AI News

News Published on: Oct 20, 2025. 11:17 PM · aurorasculpt

DeepSeek Launches 3B Model for Advanced OCR and Document Conversion

DeepSeek-AI has unveiled the 3B DeepSeek-OCR, an end-to-end OCR and document parsing Vision-Language Model (VLM) system. This system compresses lengthy text into vision tokens, which are then decoded by a language model. The research team reports a 97% decoding precision on the Fox benchmark when text tokens are within 10 times the vision tokens, with effective performance even at 20 times compression.

The DeepSeek-OCR-3B comprises two components: a vision encoder called DeepEncoder and a Mixture of Experts decoder named DeepSeek3B-MoE-A570M. The encoder is optimized for high-resolution inputs and uses a SAM-based window attention stage and a CLIP-based global attention stage. The decoder is a 3B parameter MoE model with approximately 570M active parameters per token.

DeepEncoder supports native and dynamic modes, offering various resolutions and token counts. Dynamic modes combine global and local views, allowing AI developers to align token budgets with page complexity. These modes provide flexibility for handling dense fonts or high token counts.

In the Fox benchmark study, 100 vision tokens achieved 98.5% precision at 6.7× compression for pages with 600 to 700 text tokens. On OmniDocBench, DeepSeek-OCR outperformed GOT-OCR 2.0 using only 100 vision tokens per page.

The research team describes a two-phase training pipeline, initially training DeepEncoder with OCR data and then the full system with pipeline parallelism. The system processes 90B tokens per day on text-only data. In production, it can generate over 200k pages per day on a single A100 40G node.

DeepSeek OCR is a practical advancement in document AI, treating pages as compact optical carriers to reduce decoder sequence length. The model is compatible with PyTorch 2.6.0, CUDA 11.8, and Flash Attention 2.7.3, facilitating setup for engineers.