09/29 2025
418
Comparing autonomous driving to the human brain and sensory system, data represents the raw input from external perception, while annotation tells the brain 'what this is, where it is, and how it will move.' Without high-quality annotation, even the most advanced perception, tracking, and prediction models would resemble a person without food—theoretically capable of movement but unable to perform sustained, reliable work. The task of annotation is not simply to frame objects in an image but to record ambiguous, overlapping, and transient events in the real world in a clear, unified, and machine-readable format for model learning and evaluation. For autonomous vehicles, annotation determines what the system can learn, what it can see clearly, and where it may make errors, directly affecting system safety and commercial viability.

The 'Quantity' and 'Quality' of Annotation: What Scale and Precision Are Needed?
To enable safe autonomous driving, a small number of annotated samples is insufficient. Only large-scale, multi-modal, and multi-task annotated samples can effectively leverage data annotation investments. Moreover, the required data scale and quality metrics vary significantly across different stages and objectives of autonomous driving. For prototyping or proof-of-concept, tens to hundreds of thousands of annotated frames are typically sufficient to train a basic model and enable rapid iteration. To deploy functions on closed-road trials or limited-scenario operations, data needs to expand to hundreds of thousands to millions of frames. Covering urban-scale, all-weather, and long-tail events requires pushing annotation scale to millions or even tens of millions of samples.

These 'frames' can refer to single camera images, LiDAR point cloud frames, or time-synchronized segments from multiple sensors. For camera images, common training set sizes range from hundreds of thousands to millions of labeled images. For point clouds, the typical range is tens of thousands to millions of frames, with the number of points per frame depending on the LiDAR type. Common production-grade sensors generate tens of thousands to hundreds of thousands of points per frame.
Several core metrics assess annotation usability. The first is label consistency, often quantified through inter-annotator agreement or IoU (Intersection over Union) distribution. For 2D detection tasks, with an IoU threshold of ≥0.5, a common consistency target is above 85%. For high-precision applications or small object detection, maintaining over 70% consistency at IoU ≥0.7 is desirable. Pixel-level semantic and instance segmentation require significant manual effort, so higher consistency thresholds are necessary to prevent boundary noise from affecting localization and obstacle avoidance. 3D bounding boxes in point clouds have more degrees of freedom, leading to more pronounced annotation errors. Common metrics include box center error (centimeter-level) and orientation error (degrees). In real-world projects, center errors are expected to be within 10–30 cm, and orientation errors within a few to tens of degrees, varying with operational safety boundaries.
Annotation efficiency can also be quantified. For 2D bounding box annotation or correction with automated pre-labeling, an experienced annotator can correct hundreds to thousands of images per day (assuming a low average number of targets per image). For pixel-level segmentation with proper tools and pre-labeling, an annotator can complete dozens to a hundred high-quality images daily; without assistance, speed drops significantly. Point cloud annotation is more time-consuming. With excellent tools and pre-labeling, an annotator can process dozens to a hundred 3D bounding boxes or instance labels per day. For detailed point-level semantic annotation or dense segmentation, daily output drops to a dozen or so frames. Scaling these numbers to organizational levels and time costs, supporting initial annotation of millions of frames often requires tens to hundreds of annotators working in parallel for weeks to months, depending on pre-labeling quality and review depth.
Of course, the relationship between data volume and training effectiveness is not linear, but empirical data illustrates the phenomenon of 'diminishing marginal returns.' For a fixed model and task, expanding training samples from 100,000 to 300,000 typically yields significant performance improvements. Expanding from 300,000 to 1 million still shows noticeable gains but with a smaller margin. Pushing from 1 million to several million or tens of millions results in slower performance growth, with benefits often stemming from expanded scenario coverage or longer-tail event inclusion rather than baseline average precision improvements. Thus, with limited resources, balancing data scale, annotation granularity, and scenario diversity is a core challenge in designing data strategies.

How to Leverage Tools, Processes, and Semi-Automation to Reduce Costs and Ensure Quality
Treating data annotation as an engineering task requires clear processes, user-friendly tools, and continuous quality control. Annotation platforms must display multi-modal data simultaneously (synchronized camera + point cloud + trajectory), support timeline playback, cross-frame ID tracking and editing, batch operations, and automated pre-labeling import. Effective pre-labeling can reduce manual workload by 30%–70%, depending on the model's initial capability and task complexity. For example, in vehicle and pedestrian detection tasks, incorporating a basic detection model into the pre-labeling process significantly reduces the proportion of target positions and categories requiring manual intervention per frame, cutting per-frame manual labor from minutes to tens of seconds or less.

In process design, detailed annotation specifications outweigh short-term speed optimizations. Specifications must clarify ambiguous boundaries, such as how to draw boxes during occlusion, how to label uncertain behaviors, and how to handle cross-category boundaries (e.g., distinguishing electric scooters from pedestrians). Specifications should also include extensive example and counterexample libraries to reduce annotator decision costs in ambiguous areas. Quality control typically involves two layers: automated quality checks and manual sampling. Automated checks detect obvious issues like labels exceeding image boundaries, mismatched categories and scenes, or sudden ID changes in the timeline. Manual sampling verifies semantic issues that automated checks cannot cover, such as long-term behavior annotation and complex interaction judgments.
Semi-automation and active learning are powerful tools for improving annotation efficiency. By using model uncertainty as a sampling criterion, annotation resources can be prioritized for data most valuable to the model. Active learning strategies often reduce the required annotation volume by 20%–50% while achieving performance close to full annotation, saving time and costs. However, active learning effectiveness strongly depends on evaluation metrics and sampling strategies. Blind application may concentrate resources on narrow areas where the model is 'confused,' neglecting long-tail scenarios. Thus, integrating active learning into an iterative process and fine-tuning parameters with engineering experience is essential.
When evaluating annotation ROI, consider both direct costs (manual labor, outsourcing fees) and indirect costs (storage, version management, re-annotation, privacy compliance). Pixel-level segmentation and point-level annotation have significantly higher per-unit time costs than 2D bounding boxes, with high re-annotation costs as well. Thus, setting annotation granularity to 'sufficient but not redundant' is an optimization path without explicit business needs. Many teams start with 2D bounding boxes for rapid foundational work, then upgrade key scenarios or objects to pixel-level or point-level precision, efficiently focusing resources on enhancing system safety boundaries.

Data-Driven Annotation Decision-Making
Annotation is not a one-time project but an ongoing operational challenge. As models update, business scenarios expand, and regulations evolve, label specifications and dataset versions change. A robust data governance system minimizes costs associated with these changes. To achieve this, first establish label ontology management, with clear definitions and counterexample sets for each category, subclass, and semantic hierarchy, enabling rapid judgment against specifications. Second, implement data version management and traceable change records. When label specifications update, the system must track which samples were re-annotated, who made the changes, and the difference metrics before and after. This allows rapid identification of whether label changes caused model degradation or abnormal behavior, enabling rollback or correction.
Long-term maintenance requires closing the loop between model performance feedback and the annotation system. Prioritize annotating model misclassifications, low-confidence samples, and real-world operational alerts, as these data points often improve system robustness more than random sampling. In most practices, prioritizing and re-annotating operational error samples for training is the most efficient way to enhance system performance in critical scenarios. Simultaneously, establishing periodic quality reviews (e.g., monthly) clarifies ambiguous points in annotation specifications and converts annotator questions into specification improvements or example library expansions.

Synthetic and simulated data effectively address long-tail scenarios but cannot replace real annotations. Simulation efficiently generates extreme weather, rare accidents, or high-risk interaction samples, particularly valuable when real-world collection is costly or dangerous. Common practice uses synthetic data for pre-training or reinforcing specific model strategy modules, followed by real-world data for domain adaptation and calibration. Crucially, quantify domain gaps when using synthetic data and perform closed-loop validation with real data.
Privacy and compliance are additional challenges requiring data-driven management. Road imagery often contains sensitive information like faces and license plates. Annotation processes must implement automatic blurring and anonymization at the collection or annotation stage, maintaining audit trails to meet regulatory or contractual requirements. These protections incur additional computational and storage costs and may impact algorithm performance in appearance-based behavior classification. Thus, privacy compliance should be factored into cost budgets and technical solutions from project inception.
Data strategies should vary by team scale and objectives. Resource-constrained startups should focus annotations on critical scenarios and categories, first establishing reusable annotation pipelines and specifications before expanding sample volumes. Large teams or automakers can afford to develop in-house annotation platforms, train dedicated auto-annotation models, and conduct large-scale data governance but must still prioritize tool usability and process efficiency to avoid prohibitive maintenance costs. Regardless of scale, treating data as a product and annotation as a long-term engineering investment is essential for transitioning autonomous driving from labs to real roads.

Final Thoughts
Viewing annotation as 'just another step in data engineering' undervalues its significance. Instead, annotation is a core engineering challenge for safe autonomous driving deployment, determining what worldviews the model learns, where it may err, and where human intervention is most critical. By designing annotation scale, granularity, and quality through quantitative metrics, combined with tooling, semi-automation, and active learning for efficiency, teams can maximize data value within controllable costs.
-- END --