Can Pixels Replace Text? DeepSeek-OCR Triggers Reflection on AI's Core Paradigm

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

11/10 2025 569

Recently, DeepSeek unveiled the open-sourcing of its latest large-scale model, DeepSeek-OCR. As clarified in DeepSeek's research paper, OCR represents a preliminary exploration into the feasibility of compressing lengthy contexts through optical 2D mapping. DeepSeek-OCR is composed of two key components: DeepEncoder as the encoder and DeepSeek3B-MoE-A570M as the decoder. Designed as the core engine, DeepEncoder is optimized to maintain low activation levels even under high-resolution inputs, achieving a high compression ratio while ensuring the number of visual tokens remains optimized and manageable.

In simpler terms, this is a visual-text compression paradigm that reduces the computational demands of large models by representing content—which would typically require numerous text tokens—with a significantly smaller number of visual tokens.

01. Endowing AI with 'Eyes' and Teaching It to 'Forget'

This innovation not only addresses the technical challenges of processing long texts but also prompts a reevaluation of the cognitive approaches employed by large models. Traditionally, large models have interpreted the world through abstract textual symbols, i.e., text tokens. The groundbreaking aspect of DeepSeek-OCR lies in its ability to, for the first time, equip models with genuine 'visual perception' capabilities. By converting text into images and then compressing them, the model bypasses the abstract linguistic filtering layer and directly extracts features from richer visual information, akin to how humans perceive the world directly through their eyes rather than relying solely on secondhand descriptions.

Previous large models, including ChatGPT, Gemini, Llama, Qwen, and even earlier iterations of DeepSeek, all processed data uniformly: through text, or what is commonly termed tokens.

However, contemporary large models generally struggle with efficiently processing extremely long texts. Most mainstream large models have context windows ranging from 128k to 200k tokens, yet financial reports, scientific papers, and books can easily span thousands of pages, interspersed with tables, formulas, and more. Traditional methods can only "divide the text into segments and process them sequentially," resulting in logical inconsistencies and delays. In contrast, DeepSeek-OCR adopts an unconventional approach: it converts text into images and then compresses them, decompressing only when textual output is required. This not only reduces token consumption by an order of magnitude but also preserves high accuracy.

To achieve this, the DeepSeek-OCR model introduces, for the first time, the concept of 'Context Optical Compression (COC),' which enables efficient information compression through text-to-image conversion.

The feasibility of this method has been empirically validated. At a 10x compression ratio, DeepSeek-OCR achieves a decoding accuracy of 97%, representing near-lossless compression. Even at a 20x compression ratio, accuracy remains around 60%.

In their paper, the DeepSeek team also suggests using optical compression to simulate human forgetting mechanisms.

For instance, recent memories resemble nearby objects—sharp and clear. Thus, they can be rendered into high-resolution images, utilizing more visual tokens to retain high-fidelity information.

Conversely, distant memories are akin to faraway objects—gradually fading. Therefore, they can be progressively scaled down into smaller, blurrier images, represented with fewer visual tokens, achieving a natural form of information forgetting and compression.

In this manner, theoretically, the model can dynamically allocate varying amounts of computational resources to contexts from different time periods when processing extremely long conversations or documents, potentially enabling the construction of an infinite-context architecture.

The team acknowledges that while this remains an early research direction, it offers a novel approach for models to handle extremely long contexts.

Such innovations have undoubtedly sparked profound reflection within the AI community regarding the visual approach. Andrej Karpathy, a co-founding member of OpenAI and former Director of Autopilot at Tesla, remarked that while it is an excellent OCR model, the paradigm shift it may induce is even more noteworthy.

Karpathy proposes a bold hypothesis: for large language models, are pixels a more suitable input than text?

02. The Merits of Pixels Over Text and Current Challenges

From an information density perspective, pixels, as raw visual signals, carry significantly more information than highly abstracted and compressed text. A textual description of 'a golden wheat field at sunset' can convey the concept but loses the gradient of light and shadow, the texture of the wheat ears, and the spatial hierarchy. These nuances are precisely embedded in the pixel matrix. By directly processing pixels, large models bypass the 'filtering layer' of human language and learn the complex patterns of the world from more fundamental and richer sensory data.

The DeepSeek-OCR team contends, 'An image containing document text can represent rich information with far fewer tokens than its equivalent digital text. This suggests that optical compression through visual tokens can achieve a higher compression ratio.'

Secondly, pixels possess cross-cultural universality. Text is constrained by specific linguistic systems and cultural backgrounds, creating barriers to understanding. In contrast, the pixels in an image or video represent universal physical laws (such as gravity and light), laying the groundwork for models to construct a more unified and fundamental world model. Models can comprehend physical phenomena like 'a sphere rolling' without first mastering the grammar of English or Chinese.

Ultimately, this pixel-based learning pathway more closely mirrors the human cognitive process of 'seeing is believing.' It compels models to actively abstract objects, attributes, and relationships from chaotic sensory inputs, potentially giving rise to more robust and generalizable intelligence. When models can understand and generate coherent pixel sequences (such as videos), they acquire the ability to simulate and create visual worlds—a significant stride toward artificial general intelligence. Thus, as information carriers, pixels provide large models with learning materials that are closer to reality and more authentic.

So, is the DeepSeek-OCR model flawless? Not entirely. The paper openly acknowledges its limitations.

For instance, ultra-high compression ratios entail risks. When the compression ratio exceeds 30x, the retention rate of critical information drops below 45%, rendering it unsuitable for scenarios like law and medicine that demand high precision. Additionally, complex graphic recognition remains inadequate, with accuracy rates for 3D charts and handwritten artistic text 12-18 percentage points lower than those for printed text.

03. Conclusion: DeepSeek-OCR Offers a Novel Approach

The introduction of DeepSeek-OCR marks a new exploratory phase in AI development. It signifies not merely an upgrade in technical tools but a reconstruction of cognitive frameworks—when large models begin to comprehend the world through pixels rather than pure text, we are witnessing a paradigm shift from 'symbol processing' to 'perceptual understanding.' The significance of this transformation extends far beyond resolving the specific challenge of long-text processing; it hints at a future where AI may establish a cognitive system closer to human sensory experience, directly constructing an understanding of the world from multimodal raw data.

However, as the research team cautions, this remains an early research direction. Technical breakthroughs often bring new challenges: How can we strike the optimal balance between compression efficiency and information fidelity? How can models 'learn to forget' without sacrificing critical information? These questions necessitate joint exploration from academia and industry. More importantly, how will this shift in technical pathways reshape human-computer interaction and give rise to entirely new application scenarios? These are all questions deserving of continued attention.

From a broader perspective, the visual approach embodied by DeepSeek-OCR is not a replacement for the current mainstream text-based approach but rather two complementary and symbiotic cognitive dimensions. Future artificial general intelligence may need to integrate the abstract reasoning of text with the concrete perception of vision to construct a truly comprehensive and robust intelligence system. This exploratory journey has just commenced, but the future it heralds is already exhilarating enough.

- The End -

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links