Why Doesn’t Map-Free Intelligent Driving Rely on SLAM for Local Semantic Mapping?

04/29 2026 566

The concept of map-free intelligent driving has firmly established itself within the autonomous driving sector. Traditionally, autonomous vehicles have leaned heavily on high-precision maps. However, there is now a stronger emphasis on achieving human-like driving capabilities, which involves real-time perception and understanding of the environment without depending on pre-existing maps.

In this evolving landscape, the integration of BEV (Bird's Eye View), Occupancy Networks, and Transformer models has emerged as the dominant approach. Meanwhile, SLAM (Simultaneous Localization and Mapping), a technique that once made significant strides in robotics, has not seen comparable success in intelligent driving. Why, then, does map-free intelligent driving not utilize SLAM for constructing local semantic maps?

Why Traditional Geometric Mapping Struggles to Adapt?

The fundamental principle of traditional SLAM solutions hinges on geometric constraints. These solutions rely on matching feature points, such as the edges of roadside structures and the corners of traffic signs, which are extracted by the system. Subsequently, complex mathematical formulas are employed to calculate the three-dimensional coordinates of these points. While this method excels in static and rigid settings, it encounters substantial difficulties in dynamic and non-rigid urban traffic scenarios.

When it comes to building local semantic maps, the SLAM approach essentially resembles assembling a jigsaw puzzle. It starts by identifying vehicles, pedestrians, and curbs in images, then attempts to project these semantically labeled objects into a map coordinate system. However, any occlusions in the images or minor deviations in camera angles due to vehicle movements can lead to misaligned geometric projections. This can result in duplicated objects or positional drifts on the map. Moreover, this method demands uneven computational resources. As the environmental complexity increases, maintaining a detailed local feature map consumes significant memory and processing time.

Additionally, semantic discontinuity poses an unavoidable challenge. Traditional semantic mapping solutions require the system to comprehend objects before placing them on the map. Yet, in real-world driving scenarios, we encounter numerous unclassifiable objects, such as overhanging tree branches, construction debris, or uniquely shaped vehicles. If the SLAM solution fails to accurately label these objects, they may be omitted from the local map, thereby posing a substantial safety risk for autonomous driving.

How Does the Transformer Model Revolutionize Spatial Perception?

The ascendancy of the BEV solution in autonomous driving is largely attributed to its incorporation of the Transformer architecture. This architecture is adept at managing global associations and has fundamentally transformed the way spatial features are processed. Traditional methods for converting 2D images to 3D space rely on depth estimation, which involves estimating the distance of each pixel and then projecting it outward. However, depth estimation is inherently unstable and susceptible to interference from lighting conditions, shadows, rain, and fog.

The Transformer model introduces an active querying mechanism. In the BEV space, the algorithm initially sets up an empty bird's-eye view canvas. Each position on this canvas, referred to as a Query, actively queries all camera views to determine if any pixel information within their field of view corresponds to its geographical location. This mechanism eliminates the need for the system to precisely calculate depth. Instead, it enables the system to develop human-like spatial perception through extensive data learning. It understands that when a car's front is visible in the left camera and its rear in the rear camera, these should converge into the same physical entity's features on the BEV canvas.

The primary advantage of this approach is its ability to achieve feature-level fusion rather than result-level stitching. Previously, results from each camera were forcibly combined. Now, 360-degree information is integrated at the most fundamental feature stage. Thanks to the Transformer's global attention mechanism, it can even infer the situation in occluded areas using the overall road contour. For instance, when a truck obstructs the side view, the system can mentally reconstruct the road structure behind the truck in the BEV space by analyzing the trends of lane lines in front and behind. Achieving this level of logical coherence is challenging for traditional SLAM solutions.

How Does the Occupancy Network Address Perception Gaps?

While BEV and Transformer models collaborate to solve issues related to visual field reconstruction and spatial restoration, enabling the vehicle to perceive the world's appearance and spatial distribution, the Occupancy Network's significance lies in bypassing the traditional requirement of classifying objects before recognition. It does so by determining whether space is occupied, thereby addressing perception gaps caused by the system's inability to name objects.

In SLAM semantic maps, if the system fails to identify an object, it may overlook its physical presence. In contrast, the Occupancy Network divides space into minuscule voxel blocks, with the sole objective of determining whether each block is occupied or empty.

This logic, based on geometric occupancy rather than semantic recognition, provides a physical safety net for intelligent driving systems. It perceives the world as a physical space filled with obstacles rather than a labeled classification table. Whether it's a fallen road sign, scattered cardboard boxes, or a crashed vehicle, the Occupancy Network can provide real-time feedback that the space is impassable. It doesn't need to know the object's name; it only needs to recognize that the physical space is occupied, guiding the vehicle to avoid it.

Simultaneously, this approach offers exceptional spatiotemporal continuity. By incorporating features processed by the Transformer into the Occupancy Network, the system can retain information from several previous frames, forming a 4D spatial perception with memory. Even if an obstacle is momentarily obscured by another vehicle, the system remembers that an object was detected in that voxel block and can predict its current position based on the object's motion trend. This continuous understanding of the physical world enables map-free intelligent driving solutions to navigate complex intersections and unexpected situations more calmly and safely than solutions relying on static semantic maps.

Why Has This Combination Become the Preferred Choice?

The integration of BEV, Transformer, and Occupancy Networks essentially unifies previously fragmented perception processes under a common coordinate system and mathematical framework. The primary reason SLAM solutions have not been widely adopted in intelligent driving is their attempt to establish a set of permanently fixed coordinates in a constantly changing environment. This approach is both costly and has low fault tolerance in complex urban settings.

Autonomous driving must embrace uncertainty. By leveraging the Transformer's robust capabilities to handle parallax and occlusion between cameras, utilizing the BEV perspective to provide a unified decision-making basis, and employing the Occupancy Network to compensate for shortcomings in recognizing unknown objects, autonomous driving can achieve the driving capabilities of an experienced driver. This architecture is not only more compatible with various sensor installation positions and models but also significantly simplifies the integration process between perception and downstream planning and control systems.

When the planning and control system receives not just fluctuating semantic labels and scattered point clouds but a high-definition, real-time 3D bird's-eye view containing physical occupancy information, path planning becomes as intuitive as navigating a racing game. This simplification and reconstruction from the underlying logic are the fundamental reasons why map-free intelligent driving can be rapidly implemented and demonstrate reaction capabilities surpassing those of human drivers. It's also why many automakers are boldly opting for map-free solutions.

-- END --

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.