DeepSeek Unveils New Open-Source Model DeepSeek-OCR: Text Compression Through Optics

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

11/14 2025 448

DeepSeek has just made a new OCR model with 3 billion parameters available as open-source, and it garnered over 100 downloads within a mere two hours of its release.

DeepSeek-OCR represents an initial foray into exploring the viability of employing optical two-dimensional mapping for compressing lengthy contexts. It comprises two key components: DeepEncoder and the decoder DeepSeek3B-MoE-A570M.

Experiments have demonstrated that when the quantity of text tokens is no more than 10 times the number of visual tokens, the model's decoding (OCR) accuracy can reach an impressive 97%. Even when subjected to a 20-fold compression ratio, the OCR accuracy still hovers around 60%.

The main contributions of DeepSeek-OCR are threefold:

A thorough quantitative analysis of the visual-text token compression ratio was carried out. It achieved an OCR decoding accuracy exceeding 96% at a 9-10x text compression rate, approximately 90% accuracy at a 10-12x compression rate, and around 60% accuracy at a 20x compression rate.

A novel architecture, DeepEncoder, was introduced. It maintains low activation memory and utilizes a minimal number of visual tokens, even when dealing with high-resolution inputs.

Built upon DeepEncoder and DeepSeek3B-MoE, DeepSeek-OCR excels in the OmniDocBench end-to-end model while employing the fewest visual tokens.

DeepSeek-OCR adopts a unified end-to-end VLM architecture, which includes an encoder and a decoder. The encoder, namely DeepEncoder, is tasked with extracting image features, tokenization, and compressing visual representations.

DeepEncoder is mainly composed of two parts: a visual perception feature extraction component that heavily relies on a window attention mechanism, and a visual knowledge feature extraction component that employs a dense global attention mechanism.

To assess the model's performance under varying compression rates (necessitating different numbers of visual tokens) and to enhance the practicality of DeepSeek-OCR, the research team has equipped it with multiple resolution modes.

The native resolution offers four sub-modes: Tiny, Small, Base, and Large, with corresponding resolutions and token counts of 512×512 (64), 640×640 (100), 1024×1024 (256), and 1280×1280 (400), respectively.

The decoder utilizes DeepSeekMoE, which combines the expressive power of a 3B model with the inference efficiency of a 500M compact model.

The researchers also compiled complex and diverse training data for DeepSeek-OCR, encompassing traditional OCR tasks such as scene image OCR and document OCR, as well as parsing tasks for intricate artificial images.

The training process is notably straightforward, primarily consisting of two stages:

Independently training DeepEncoder;

Training DeepSeek-OCR.

The results indicate that within a 10x compression ratio, the model's decoding accuracy can reach approximately 97%, a highly promising outcome.

Performance starts to decline when the compression ratio surpasses 10x. This can be attributed to two factors: firstly, the layout of long documents becomes increasingly complex; secondly, long text becomes blurred at resolutions of 512×512 or 640×640.

To quantify OCR performance, the research team evaluated DeepSeekOCR on OmniDocBench. The results showed that DeepSeek-OCR, with only 100 visual tokens (640×640 resolution), outperformed GOT-OCR2.0, which utilized 256 tokens. With 400 tokens (285 effective tokens, 1280×1280 resolution), it achieved performance on par with the state-of-the-art on this benchmark.

These findings demonstrate that the DeepSeek-OCR model is highly potent in practical applications and boasts a higher research ceiling due to its superior token compression rate.

In the realm of financial research reports, DeepSeek-OCR's deep parsing mode can extract structured results for charts within documents.

For books and articles, the deep parsing mode can generate dense captions for natural images contained in documents. Simply by inputting prompts, the model can automatically identify image types and produce the desired results.

The researchers believe that DeepSeek-ORC will make significant contributions to the future development of VLMs and LLMs. Moreover, the DeepSeek-OCR model is highly practical, capable of facilitating large-scale pre-training data production, and serves as an indispensable assistant for LLMs.

Of course, OCR alone is not sufficient to fully validate true contextual optical compression. DeepSeek plans to conduct evaluations such as digital-optical text interlaced pre-training in the future.

References:

https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links