12/01 2025
440
In the realm of autonomous driving, point clouds serve as a critical perception signal. For instance, the point cloud generated by LiDAR (Light Detection and Ranging) comprises a multitude of points distributed in three-dimensional space, each characterized by coordinates, intensity, and a timestamp. A single frame of point cloud data resembles a scene where "stars are scattered across the ground." Unlike traditional images, point clouds lack a structured pixel grid, possess no inherent color (unless fused with camera data), and contain no direct semantic labels.
For machines to effectively "learn" from point clouds, they must not only convert the input into a numerical structure that models can process but also learn to map human-defined semantics (such as "this is a pedestrian" or "this is a lane") onto these numerical values.
The process of transforming point clouds into a format understandable by models involves several steps, including cleaning, encoding, labeling, and more. These steps convert unstructured geometric information into structured tensors and supervised signals. Subsequently, appropriate networks or algorithms are employed to learn these mapping relationships.

How to Convert Point Clouds into Information Suitable for Model Learning?
The raw point cloud data collected by autonomous vehicles cannot be directly utilized due to issues such as isolated noise points, ground reflections, and environmental factors like rain, snow, fog, as well as scan distortions caused by vehicle motion. To make point clouds suitable for model use, preprocessing is essential. The primary goal is to normalize the raw point cloud, minimize interference, and retain discriminative information. Preprocessing steps include time synchronization and distortion compensation (aligning points to a unified coordinate system), ground removal and noise reduction (eliminating ground and planar noise for better object segmentation), and downsampling (reducing the number of points using voxelization or random sampling to facilitate subsequent computations). These operations are crucial for ensuring the stability of model training and inference.
Neural networks typically prefer regular tensors, whereas point clouds are unordered sets. Therefore, addressing the "representation" problem of point clouds is essential. Mainstream approaches include directly using point sets as input and designing point-level networks, discretizing point clouds into voxel grids and employing sparse convolutions, projecting point clouds into 2D images (such as bird's-eye view (BEV) or range images) and using 2D convolutions, and adopting hybrid methods that combine different representations. Different methods influence the model's computational complexity, memory usage, and retention of details. For example, point-level methods retain details well but are computationally expensive, while BEV is fast and planning-friendly but loses height information.
In addition to spatial representation, engineering features such as each point's echo intensity, local density, normal vector estimation, and height from the nearest ground can be extracted and used alongside coordinates as network inputs to assist the model in learning richer geometric cues.

How Do Models Learn Content from Point Clouds?
For a model to learn "what" and "where," it must first undergo training through labeling. This involves assigning each point a category (such as road, car, pedestrian) or enclosing cars or pedestrians with 3D bounding boxes, and even labeling their speeds and trajectories. The quality of labeling directly determines the model's upper performance limit. Therefore, semi-automatic labeling tools, projection onto images for auxiliary labeling, and synthetic data are often employed to assist in the labeling process.
When training models, it is crucial not only to teach the model to recognize objects (e.g., "this is a car," "this is a pedestrian," "this is a road") but also to instruct it on how to enclose objects with 3D bounding boxes, indicating their location and size. Moreover, the model must learn to judge the orientation of the boxes and how well they align with real objects.
Furthermore, to enhance model robustness, edge scenarios must be incorporated into the learning process. The frequency of different types of scenes can be adjusted to ensure the model understands not only common objects but also more complex scenarios. During model training, the model can be designed to assess its confidence in its judgments, enabling the system to determine how much to trust its results when making decisions.
Data augmentation is also crucial during model training. It involves slightly modifying existing data to increase diversity. This can be achieved by randomly rotating, scaling, or translating the point cloud, intentionally deleting some points, adding noise, or copying real objects from one scene to another. These techniques enable the model to perform better in situations involving occlusion, sparse points, or distant targets.
Self-supervised or contrastive learning has recently gained popularity. This approach involves initially training the model on a large amount of unlabeled point cloud data to learn "shape and structure understanding" abilities, followed by fine-tuning with a small amount of human-labeled data to achieve optimal results. This significantly reduces the workload associated with manual labeling.

How Are Models Deployed After Training?
Training a model is merely the first step; significant work remains to deploy it safely in vehicles. Initially, strict evaluation metrics are required to measure the model's performance in detection, segmentation, and tracking. Detection tasks are commonly evaluated using mean average precision (mAP) and 3D bounding box overlap, while segmentation tasks often utilize mIoU (mean Intersection over Union). These metrics facilitate the comparison of performance across different models and training configurations. However, offline metrics alone are insufficient; scenario coverage testing and stress testing in more realistic environments are necessary to ensure acceptable performance in rare or extreme situations.
During deployment, the trained model must undergo compression and optimization to meet the computational power and latency constraints of the vehicle. This can be achieved through model pruning, weight quantization, and the utilization of sparse computation libraries. It is crucial to ensure that accuracy degradation during inference remains controllable during compression and optimization. After deployment, A/B testing and online regression monitoring are necessary to continuously observe the model's performance on real roads. If degradation or anomalies occur, a swift rollback to the previous stable version is required.
The data environment during training (such as sensor type, mounting height, urban layout) may differ from the model's actual operating environment. To mitigate performance degradation caused by these differences, domain adaptation, data augmentation, or periodic online fine-tuning are necessary. For long-running fleets, establishing a data closure loop is essential to collect abnormal scenarios online, annotate them, and perform periodic retraining to continuously enhance the model's adaptability to the real world.

Final Remarks
To enable models to learn from point clouds, the point cloud data must first be cleaned and organized into a format that models can process. Subsequently, human semantics must be imparted to the model through labels and training objectives. Through reasonable representation, augmentation, and training strategies, the model can learn to recognize categories, localize objects, and estimate uncertainty. After engineering these capabilities and implementing continuous data closure loops and safety mechanisms, autonomous vehicles can safely and reliably operate in complex real-world environments using point cloud data.
-- END --