03/03 2026
493
In the realm of autonomous driving technology, the pure vision approach has gained popularity among numerous automakers due to its emulation of human driving logic and relatively low hardware costs. However, this method, which heavily depends on cameras, experiences a marked decline in performance under conditions such as nighttime, dimly lit tunnels, intense backlight, or adverse weather like heavy rain, snow, or fog. Why does lighting wield such a profound influence on pure vision-based autonomous driving?
The Physical Limits of Passive Perception
The pure vision perception system is fundamentally a passive measurement system reliant on ambient light reflection. The defining characteristic of this system is that the camera does not emit energy; all information it gathers comes from external light sources, such as sunlight, streetlights, or the headlights of other vehicles, reflecting off object surfaces and returning as photons.
This operational mode is akin to that of the human eye. When ambient light is abundant and evenly distributed, the camera can capture rich color, texture, and semantic information, crucial for identifying traffic signs, discerning road markings, and interpreting complex traffic intentions. However, once the light source is absent or the lighting environment becomes extreme, the drawbacks of passive perception become evident.
In contrast, LiDAR and other active sensors function like "vision with a built-in flashlight." LiDAR actively emits controlled laser pulses and receives the reflected energy from targets, using the time-of-flight principle to directly calculate the spatial coordinates of objects. This active detection mechanism enables LiDAR to maintain high perception accuracy even in complete darkness, with minimal interference from ambient light.
In low-light environments, the primary challenge for camera sensors is a sharp decline in the signal-to-noise ratio (SNR). When photons are scarce, the effective signals captured by the sensor may be overwhelmed by thermal noise generated by the circuitry. To "see" objects in the dark, the system must extend exposure time or increase sensitivity (ISO).
Extending exposure time is particularly hazardous in dynamic driving scenarios because the relative motion between the vehicle and the target can cause severe motion blur in the image, rendering once-clear outlines of targets as indistinct shadows.
On the other hand, blindly increasing sensitivity introduces a significant amount of random noise, filling the image with impurities and severely interfering with the backend neural network's ability to extract object features. This "raw material," already compromised at the physical level, ensures that the pure vision approach struggles in low-light conditions.
Interception and Distortion of Light Waves by Environmental Media
Autonomous vehicles do not operate in a vacuum; light must traverse a complex atmospheric environment from object surfaces back to the camera. Adverse weather conditions such as rain, snow, and fog alter the propagation path of light waves, imposing multiple barriers on visual perception through physical phenomena like scattering, refraction, and absorption.
The impact of fog on vision primarily arises from Mie scattering. The diameter of fog droplets is typically comparable to the wavelength of visible light. When light waves encounter these tiny water droplets, they scatter intensely in all directions.
This scattering effect has two severe consequences: first, the intensity of light rapidly attenuates during propagation, causing distant objects to disappear from the image; second, background and ambient light are scattered into a white "curtain," significantly reducing target contrast.
From a signal processing perspective, fog is equivalent to superimposing a large-scale low-pass filter on the image, filtering out most high-frequency details. Neural networks struggle to identify pedestrian edges or lane markings obscured by fog when processing such images, leading to a significant drop in recognition confidence or even complete missed detections.
Rainy scenarios present another issue. Falling raindrops have extremely high transparency and unique geometric shapes, acting like tiny spherical lenses that refract and totally reflect light passing through them. This causes local distortions and artifacts in the images captured by the camera.
A more severe problem occurs on the protective glass covering the camera surface, where adhered raindrops cause large areas of image blur. Since these raindrops are in the near-focus position of the camera, they create severe defocusing, rendering critical areas of the image invisible.
In snowy environments, the visual system faces the dual challenges of contrast loss and physical obstruction. Snowflakes have extremely high light reflectivity, leading to large areas of overexposure in images under strong light; on cloudy days, the lack of sufficient contrast between white snow, white vehicles, and white road signs makes it difficult for perception algorithms to distinguish targets from the background. Additionally, sticky snow may directly cover the camera lens, a form of physical "blinding" that no software algorithm can recover from.
These physical-level interferences directly challenge the pure vision system's ability to model spatial geometric structures. Since cameras cannot strip away environmental noise through precise pulse return times like LiDAR, they must rely on probabilistic predictions to guess the existence of objects amid chaotic pixels. In such cases, the interception of light by physical laws effectively cuts off the information source on which the visual system relies.
The Overlooked Information Loss in Image Signal Processors
Even if light successfully penetrates the atmosphere and is captured by the camera sensor, there is still a complex step between the raw electrical signals (RAW data) output by the photosensitive units and the final colored images (RGB images) that enter the autonomous driving "brain": the image signal processor (ISP).
For a long time, the tuning goal of in-vehicle ISPs has been to serve "human viewing," pursuing visual effects with vivid colors, high contrast, and minimal noise. However, this pursuit of "aesthetic" processing is detrimental to machine vision algorithms.
The ISP processing pipeline includes multiple stages such as demosaicing, white balance correction, denoising, gamma correction, and tone mapping. In low-light or high dynamic range (HDR) scenarios, the side effects of ISP are particularly evident. To suppress noise in low-light conditions, ISPs employ powerful spatial or frequency domain denoising algorithms. While these algorithms remove random noise, they also indiscriminately erase fine texture details, resulting in an "oil painting" effect in the image.
For human drivers, this smoothing may enhance visual comfort, but for deep learning models that rely on pixel-level feature gradients for object detection, it means losing critical high-frequency information for judging object edges.
Another issue lies in dynamic range processing. The brightness span in nature can exceed 140dB, while the dynamic range of mainstream in-vehicle camera sensors is generally around 120dB. When a vehicle exits a dark tunnel and suddenly faces blinding sunlight, the ISP must adjust exposure parameters in an extremely short time.
Traditional HDR technology achieves high dynamic display through multi-frame exposure synthesis, but this introduces severe motion artifacts at high speeds. Due to the time difference between different exposure frames, fast-moving objects appear as double images or ghostly shadows in the synthesized image, preventing the autonomous driving algorithm from accurately judging object boundaries.
Furthermore, the tone mapping and gamma correction performed by the ISP are essentially nonlinear information compression processes. To map the 20-bit or 24-bit high dynamic RAW data captured by the sensor to an 8-bit or 10-bit RGB space, the ISP forcibly compresses the contrast in shadow and highlight regions.
In this process, subtle brightness differences that were clearly distinguishable in the RAW domain are forcibly merged into the same pixel value. This mathematically irreversible loss deprives the perception network of the possibility of "microsecond-level detection" in extreme light and shadow scenarios.
This misalignment between "human-eye-oriented" and "machine-oriented" processing is a significant contributing factor to the poor performance of pure vision approaches in extreme scenarios. Currently, some technical solutions are attempting to bypass traditional ISPs and directly use RAW domain data for end-to-end object detection training to preserve all the original information from the photosensor, indirectly proving the limitations of traditional processing pipelines in light and shadow challenges.
The Cognitive Boundaries of Deep Learning in Extreme Scenarios
Pure vision autonomous driving relies on deep learning algorithms; however, the performance of object detection models based on convolutional neural networks (CNNs) or Transformers is highly dependent on the distribution of training data. When facing significantly deteriorated lighting conditions, "cognition" at the algorithmic level also exhibits severe biases.
The basis for neural networks to extract object features lies in the contrast gradients between pixels. In cases of intense backlight or direct nighttime high-beam illumination, light produces severe "glare" and "blooming" effects. When an extremely bright point light source (such as the high beams of an oncoming vehicle) shines on the sensor, the generated charge overflows into adjacent pixels, causing large areas of bright spots in the image.
This phenomenon not only obscures the texture of the obstacle itself but also completely destroys its geometric outline. When high-frequency components in the feature map disappear due to overexposure or extremely low brightness, the convolutional kernels fail to capture effective activation signals, causing the system to logically "ignore" the existence of the obstacle.
Additionally, the only way for monocular pure vision systems to obtain depth is through algorithmic guessing. The model estimates distance by identifying object types and combining empirical values of "closer objects appear larger" or changes in road texture. However, in poorly lit nights, road textures are almost invisible, and object visual features are distorted by noise.
In such cases, the algorithm's depth estimation becomes highly unstable. Even if the system identifies a pedestrian ahead, it may fail to accurately judge the distance, leading to malfunctioning emergency braking decisions. In high-speed scenarios, a few meters of distance deviation can be enough to determine the occurrence of an accident.
There is also a deeper issue: current pure vision models are essentially performing a form of "pattern matching." When 99% of the scenes in the training dataset are sunny, well-lit highways, the model develops a prior bias.
When it encounters bizarre contours produced by dramatic light and shadow alternations at a nighttime tunnel entrance, the model may incorrectly classify them as non-threatening shadows or road debris. This lack of generalization ability for long-tail scenarios (Edge Cases) is a gap that the pure vision approach must overcome to achieve L4 or higher levels of autonomous driving.
Final Thoughts
From the low signal-to-noise ratio of passive perception to the interception of photons by atmospheric media, from data cropping during ISP processing to the cognitive helplessness of neural networks in the face of feature loss, each link accumulates perception errors in pure vision autonomous driving. Although the boundaries of the pure vision approach are continuously expanding with the application of sensors with larger dynamic ranges, the introduction of end-to-end RAW domain perception, and the supplementation of cross-modal training data, its "light and shadow dead zones," determined by physical properties, remain a core issue that the industry must carefully address when balancing safety and cost.
-- END --