3D LiDAR point cloud semantic and noise point recoginition
Problem Context
3D LiDAR point cloud data can be considered as a set of triplets, where each triplet represents the x, y, and z position of a specific point in 3D space. To make this data meaningful for guiding autonomous driving, we need to develop algorithms that analyze the points, enabling the driving systems to plan and act accordingly. Unlike 2D image computer vision, 3D data is more challenging due to its high sparsity. Depending on the task, multiple families of point cloud algorithms might be used.
For simple global classification, we may not need to maintain spatial information, so a function approximator can be applied to each point without computing their relations. Another approach is to voxelize the 3D space by merging points that lie within the same grid into a single embedding. With this method, we can apply traditional convolution on the voxels. Although this method seems efficient, the sparsity of point cloud data results in the majority of voxels in the space being zero, leading to a waste of computing resources.
To address this problem, researchers have proposed submanifold sparse convolution, which performs convolution only on positions with non-zero values. Additionally, there are algorithms that combine voxels and points to retain as much spatial and detail information as possible, such as SPVNas. Since the computing resources in autonomous vehicles are quite limited, it is crucial to design algorithms efficiently to achieve the best trade-off between performance and frames per second (FPS).
Implementation
The first step in building a point cloud recognition system is to label the data. Different tasks have varying label requirements, such as drawing bounding boxes for detection and labeling each point for semantic segmentation. Next, we can start building our deep learning model using many open-source architectures that have been proven efficient in lab environments and published. Typically, we will need to adjust some hyperparameters to fit our specific problems. After training until we have obtained an acceptable model, the next step is to deploy the model into production. This is an essential step because converting the computation from Python to C++ often results in more than a 50% reduction in memory cost and computing time. Public tools such as ONNX are useful for this purpose. For NVIDIA devices, adding an extra step to convert the ONNX model to a TensorRT engine can further accelerate model computation. A major difficulty in converting to deployment-ready models is writing custom operation plugins if our model contains custom operations that are not yet supported by the aforementioned frameworks (layers like conv and linear are mostly supported, but submanifold sparse convolution is usually not).
Despite researchers publishing many high-performing models, they might be too complex for industry applications where needs are more specific. To meet these specific needs, it might be more appropriate to draw inspiration from academia and then develop algorithms that suit customized requirements.