DeepSeek introduces visual primitives to enhance AI's spatial reasoning.

Article | LetterAI

The day before the May Day holiday, DeepSeek unexpectedly released a report on visual multimodal technology.

Before I clicked, I had some idea of what to expect—basically how far I could see and how clear it would be.

After all, over the past year, multimodal models have largely been competing in this direction. OpenAI has talked about thinking with images, enabling models to crop, zoom, and rotate images during reasoning; Gemini and Claude are also working on enabling models to handle higher-resolution and more complex visual inputs.

The common assumption is that if the model looks more closely, visual reasoning will naturally become stronger.

But after reviewing DeepSeek’s report, you’ll find they’ve taken a completely different path.

DeepSeek did not focus on "showing the model more pixels"; instead, they focused on a more fundamental issue.

Even if the model has clearly seen it, how can you be sure that the model is referring to the same thing you are during reasoning?

Actually, this is the most overlooked flaw in multimodal reasoning.

When humans look at images, they can use their fingers to point out objects, like “This person is so-and-so” or “That person is so-and-so.” But how would the model know which one you’re referring to?

The model can only use language to say "the one on the left," "the one on top," or "this line." Once the image becomes complex, linguistic references become ambiguous, and reasoning follows suit and collapses.

So DeepSeek said, why not just give the model a "finger"?

It transforms dots and bounding boxes into the fundamental units of the model’s reasoning, enabling the model to point at objects with this cyber finger while reasoning.

01 From Continuous Vision to Discrete Symbols

In this technical report, DeepSeek raises an interesting question: they believe the real challenge for multimodal models is not seeing images, but consistently referring to the same visual object throughout continuous reasoning.

For example, you tell your friend, "The vegetables at Old Lady Zhang’s stall at the market are the freshest." But there are so many elderly men and women at the market—which one is Old Lady Zhang?

But if you point directly and say, “That one,” your friend will immediately understand.

DeepSeek refers to this issue as the "Reference Gap."

Over the past year, nearly all cutting-edge multimodal models have been addressing the "perception gap."

Imagine a photo placed in front of you—if it’s too blurry or has low resolution, you might not be able to read the small text or see details in the distance. The same goes for AI: if the input image is of poor quality or processed incorrectly, the AI will “fail to see” clearly—that’s the perception gap.

Models like GPT, Claude, and Gemini are continuously improving resolution by introducing high-resolution cropping, dynamic tiling, and multi-scale processing, all aimed at enabling the model to see more details.

This direction is certainly valuable, but DeepSeek points out in the report that even if the model sees clearly, it still experiences logical breakdowns in complex spatial reasoning tasks.

The issue lies in natural language itself.

The photo contains dozens of dogs, so when you say "the dog on the left," the model cannot understand which specific dog you mean.

Even more challenging, if you ask the model to count the number of dogs in a photo, it often loses track of which ones it has already counted and which ones remain.

The report also mentioned extreme cases such as maze navigation, where pure language cannot accurately describe irregular paths and complex topological relationships.

Language, as a referential tool, is inherently ambiguous within a continuous visual space. It excels at abstract concepts and causal relationships, but fundamentally lacks the capacity to express spatial positioning and topological relationships.

DeepSeek is itself a general-purpose language model; how should this be addressed?

This is the "finger" mentioned at the beginning of the article.

The core concept they propose is "visual primitives," specifically elevating bounding boxes and points—the most fundamental spatial markers in computer vision—to the status of "minimum units of thought."

Previous multimodal models could also draw bounding boxes around objects, but they only showed you the final result to prove “I found it.” It’s like taking an exam and submitting only the answer without showing your work.

Some studies have also had AI draw boxes during its reasoning process, but the purpose is only to “see more accurately”—the boxes are merely an auxiliary tool. It’s like using scratch paper when solving math problems: the paper helps you calculate more clearly, but it’s not part of the solution itself.

DeepSeek is doing something completely different.

They embed these spatial markers directly into the model's reasoning process, making them an organic part of the inference. When thinking, the model doesn't just describe in language, "I see a dog," but also outputs, "I see a dog, and it is here: [[x1,y1,x2,y2]]".

This mechanism is called "point while it reasons" by DeepSeek.

DeepSeek

Each step of the model's reasoning is anchored to specific coordinates in the image.

The technical report provided this example: the model starts from the origin, explores, backtracks, and tries again, ultimately outputting a complete sequence of coordinates, each corresponding to a point visited in the maze.

This way, the model won’t get lost during inference. It won’t be confused about what it’s saying or referring to. Each visual object has a clear spatial anchor, making the reasoning process traceable and verifiable.

This technical direction presents an interesting contrast to OpenAI's approach.

OpenAI explicitly mentions the concept of "thinking with images" in the official documentation for o3 and o4-mini, meaning the model can incorporate images into its reasoning chain and process them through cropping, zooming, rotating, and other methods. The focus of this approach is to make images an integral part of the reasoning process, enabling the model to generate new images, modify existing ones, and perform operations on images during reasoning.

OpenAI's roadmap emphasizes general capabilities, with vision, code, search, files, and tool calling working together. The model features a powerful "visual workspace" that can flexibly handle a variety of visual tasks.

DeepSeek’s approach is more “symbolic.” It incorporates coordinates into the chain of thought, explicitly writing bounding boxes and point coordinates within the reasoning text, turning visual objects into reusable anchors during inference.

This means that OpenAI’s visual reasoning occurs internally, and users can only see the final answer and necessary explanations, with the intermediate visual processing steps being a black box. DeepSeek, on the other hand, deliberately makes the intermediate visual anchors explicit, rendering the entire reasoning process transparent.

By doing this, DeepSeek makes the reasoning process easier to train, inspect, and score. It also simplifies the design of format, quality, and task-level rewards. Especially in tasks like mazes and path tracing, it enables more granular feedback on path validity, trajectory coverage, and other metrics.

The model has not only learned to output correct answers, but has also learned how to reason using visual primitives.

02 Efficiency is the core

DeepSeek’s report contains a subtle but crucial detail: their model uses far fewer tokens when processing images compared to other state-of-the-art models.

The report includes a comparison chart showing the number of tokens consumed by different models when processing an 800×800 resolution image.

Gemini-3-Flash has about 1,100, Claude-Sonnet-4.6 has about 870, GPT-5.4 has about 740, Qwen3-VL has about 660, DeepSeek has about 361, and only around 90 entries are retained in the KV cache.

This gap is not small. DeepSeek uses only one-third the number of tokens compared to Gemini, and about one-tenth the number of KV cache entries.

How is this extreme efficiency achieved?

DeepSeek uses a mechanism called Compressed Sparse Attention (CSA).

You can think of it this way: if you show a family photo to a friend, you wouldn’t say, “Starting from the 237th pixel from the left, there’s a red area…”—you’d simply say, “On the left is my mom, on the right is my dad.”

DeepSeek-ViT first compresses the image into fewer visual tokens, and CSA further compresses the representation of these visual tokens in the KV cache.

This mechanism was previously used in the DeepSeek-V4-Flash model and is now being applied to visual multimodal systems.

The compression process works as follows. A 756×756 image contains 571,536 pixels. These pixels are first processed by ViT, divided into patches of size 14×14, generating 2,916 patch tokens. Then, a 3×3 spatial compression is applied, reducing every 9 adjacent tokens along the channel dimension into one, resulting in 324 visual tokens.

These 324 tokens are fed into the large language model for pre-filling. Finally, the CSA mechanism compresses these visual tokens in the KV cache by another 4x, retaining only 81 entries.

From 571,536 pixels to 81 KV cache entries, the overall compression ratio reaches 7,056 times.

Most major AI companies rely on brute-force methods to accumulate computing resources, while DeepSeek makes trade-offs at the level of information theory, retaining only the most intuitive and straightforward information.

The most direct result is that inference speed has increased significantly.

The number of image tokens directly affects the model's inference latency. During autoregressive generation, each time a new token is generated, the model must perform attention computations over the KV cache of all previous tokens. If an image occupies 1,000 tokens, attention must be computed over these 1,000 tokens for every generation step. If it only occupies 90 tokens, the computational load is significantly reduced.

For applications requiring real-time responses, such as robotic vision, autonomous driving, and real-time video analysis, improved inference speed plays a decisive role.

And it also uses less memory.

KV caching is a memory bottleneck in large model inference. Especially when handling long contexts or batched inference, KV caching consumes significant GPU memory. DeepSeek compresses the KV cache for visual tokens to just 90 entries, enabling more images to be processed or longer multi-turn dialogues to be handled on the same hardware.

This is crucial for real-world deployment. Many companies’ multimodal models perform well in the lab but encounter cost issues when deployed practically. The more tokens each image consumes, the higher the inference cost and the fewer concurrent users can be supported. DeepSeek’s efficiency advantages are amplified during large-scale deployment.

It also indirectly increases the model's context capacity.

If an image requires 1,000 tokens, only about 100 images can fit within a 128k context window. If it only requires 300 tokens, over 400 images can be accommodated—this is crucial for scenarios involving multi-image conversations, long-form video analysis, and extensive document understanding.

DeepSeek's model can process more images within a single conversation, enabling comparison and analysis of dozens or even hundreds of images, and can track long-term changes in videos.

The most important thing is the training cost.

Although the report primarily focuses on inference efficiency, this compression mechanism is equally effective during training. Fewer visual tokens mean a smaller computational graph, faster training speeds, and lower hardware requirements.

DeepSeek has always been known for achieving better results with fewer resources. From R1’s reinforcement learning training, to V4’s MoE architecture, and now to visual multimodal capabilities, this efficiency-first philosophy has remained consistent throughout.

But here’s a key question: Will compression result in information loss?

DeepSeek does not deny that compression leads to information loss. Its claim is that, on this set of spatial reasoning and counting tasks, the compressed representations remain sufficiently effective.

Each compression step retains the information most critical for inference while discarding redundancy and noise.

In fact, the visual primitive mechanism of DeepSeek mentioned earlier is also a form of information compression. A bounding box can precisely locate an object using just four numbers, and a single point can mark a position using just two numbers. These discrete symbols carry much higher information density than raw pixels.

The experimental results show that this compression does not harm performance and even improves it on certain tasks.

This suggests that for many visual reasoning tasks, the bottleneck is not due to insufficient clarity in perception, but rather the lack of an appropriate representation method.

This efficiency advantage also demonstrates that multimodal intelligence does not necessarily require larger models, more computing power, or higher costs.

Since its inception, DeepSeek has always had an underlying principle: "True intelligence lies not in computing power, but in understanding the essence of a problem."

When you truly understand what visual reasoning requires, you won’t need so many tokens. When you find the right representation, you won’t need such a large model.

From this perspective, DeepSeek’s extreme efficiency is not the goal, but a byproduct. The real goal is to find the correct paradigm for visual reasoning. Efficiency merely proves that this paradigm is right.

03 Unfinished Business

In the limitations section of the report, DeepSeek candidly outlined several issues with the current approach. These are not minor technical flaws, but rather indicators pointing toward the next stage of visual reasoning.

The first issue is trigger word dependency.

The report explicitly states that the current "thinking in visual primitives" capability requires explicit trigger words to activate. This means the model cannot yet naturally or autonomously decide when to draw boxes or place dots.

It means the model has not yet truly learned when to use visual primitives and when language alone is sufficient.

Ideally, the model should autonomously decide based on the nature of the task. However, when a user asks, “How many dogs are in the image?”, the model should automatically switch to visual primitive mode and use bounding boxes to assist with counting.

Technically, this requires building a metacognitive layer within the model. This metacognitive layer can assess the complexity of the current task, determine whether pure linguistic reasoning is sufficient, and decide whether to invoke visual primitives.

DeepSeek has not yet implemented this metacognitive layer, but they have clearly defined the direction. Future versions may enable the model to learn how to autonomously decide on reasoning strategies, rather than relying on external triggers.

The second issue is resolution limits.

The report notes that, due to input resolution limitations, the model's performance in fine-grained scenarios is still insufficient, and the generated visual primitives are sometimes not precise enough.

This issue is related to DeepSeek’s efficiency-first strategy. To control the number of tokens, they limit the visual token range to between 81 and 384. Images outside this range are resized accordingly.

This design is reasonable for most scenarios, but it encounters limitations in tasks requiring extremely high precision. For example, medical imaging analysis needs to detect tiny lesions, and industrial quality inspection requires identifying minute defects—these scenarios demand high resolution.

DeepSeek mentions in the report that this issue can be addressed by integrating existing high-resolution methods. In other words, their visual primitive framework and traditional high-resolution cropping methods are not mutually exclusive but complementary.

I think DeepSeek could offer a hybrid solution.

For most routine tasks, use compressed visual representations and visual primitive reasoning to maintain high efficiency. For localized regions requiring fine-grained analysis, dynamically invoke high-resolution crops to extract more detailed visual information. This approach preserves overall efficiency while meeting local precision requirements.

The key to this hybrid approach is teaching the model to determine which areas require high-resolution processing, which brings us back to the earlier issue of metacognition.

The third issue is cross-scenario generalization.

The report notes that using dots as visual primitives to solve complex topological reasoning problems remains challenging, and the model's ability to generalize across scenarios is limited.

This issue is particularly evident in maze navigation and path tracking tasks. Although DeepSeek achieved accuracy rates of 66.9% and 56.7% on its own constructed test set, surpassing other models, these figures themselves are still insufficient.

More importantly, these tasks were trained and tested on synthetic data. Mazes were algorithmically generated, and the paths to trace were procedurally drawn. When the model encounters real-world topological reasoning problems—such as planning routes on actual maps or tracing connections in complex pipeline diagrams—its performance may decline.

DeepSeek's approach enhances generalization through large-scale, highly diverse data. They crawled 97,984 data sources, rigorously filtered them down to 31,701, and ultimately obtained over 40 million samples. For maze and path-tracking tasks, they designed a variety of topologies, visual styles, and difficulty levels to capture as many variations as possible.

However, data diversity is only one part of generalization ability. Does the model truly understand the essence of topological reasoning, or is it merely memorizing patterns from the training data?

Additionally, DeepSeek’s visual primitives constitute a new representation system requiring specialized data formats, training procedures, and evaluation methods, which are not fully compatible with the existing multimodal ecosystem.

Most multimodal datasets and evaluation benchmarks are designed based on the traditional "image + text" paradigm and do not consider visual primitives. To evaluate DeepSeek's models on these benchmarks, either the visual primitives functionality must be disabled, or the evaluation methods must be redesigned.

Other researchers who wish to reproduce or improve this work need to rebuild the entire data and training pipeline, which presents a high barrier to entry.

The fact that DeepSeek can address these issues in their report shows they have a clear understanding of their work.

This may be more valuable than providing a perfect answer, because what often drives societal progress is not the answer, but the question.