DeepSeek-V3.2-Exp

Experimental sparse attention model for efficient long-context processing

Deploy Now

read docs

About model

DeepSeek-V3.2-Exp is an experimental model that introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism designed to dramatically improve training and inference efficiency in long-context scenarios. Built on V3.1-Terminus, this model achieves substantial computational efficiency gains while maintaining virtually identical output quality and performance across diverse benchmarks including reasoning, coding, mathematics, and agentic tasks.

‍

Quickstart guides

Agents

How to Build Coding Agents

RAG

Building a RAG Workflow

Integrations

Using Together with Vercel's AI SDK

Model card

Architecture Overview:
• 685B total parameters with Mixture-of-Experts (MoE) architecture
• Multi-Latent Attention (MLA) with MQA mode for efficient key-value sharing
• 128K token context window with extended long-context capabilities
• DeepSeek Sparse Attention (DSA) featuring a lightning indexer and fine-grained token selection

Training Methodology:
• Continued pre-training from DeepSeek-V3.1-Terminus base checkpoint
• Two-stage training: dense warm-up (2.1B tokens) followed by sparse training (943.7B tokens)
• Lightning indexer trained with KL-divergence alignment to main attention distribution
• Post-training includes specialist distillation across mathematics, coding, reasoning, and agentic domains
• Group Relative Policy Optimization (GRPO) for reinforcement learning alignment

Performance Benchmarks:
DeepSeek-V3.2-Exp demonstrates performance on par with V3.1-Terminus across comprehensive evaluations:

Benchmark	DeepSeek-V3.1-Terminus	DeepSeek-V3.2-Exp
Reasoning Mode (General)
MMLU-Pro	85.0	85.0
GPQA-Diamond	80.7	79.9
Humanity’s Last Exam	21.7	19.8
Code
LiveCodeBench	74.9	74.1
Codeforces-Div1	2046	2121
Aider-Polyglot	76.1	74.5
Math
AIME 2025	88.4	89.3
HMMT 2025	86.1	83.6
Agentic Tool Use
BrowseComp	38.5	40.1
BrowseComp-zh	45.0	47.9
SimpleQA	96.8	97.1
SWE Verified	68.4	67.8
SWE-bench Multilingual	57.8	57.9
Terminal-bench	36.7	37.7

Efficiency Characteristics:
• Reduces core attention complexity from O(L²) to O(Lk) where k≪L
• Up to 70% cost reduction for long-context inference at 128K tokens
• Selects 2048 key-value tokens per query token during sparse attention
• Optimized for H800, H200, MI350, and NPU deployments with specialized kernels

Applications & use cases
Long-Context Processing:
• Extended document analysis and summarization up to 128K tokens
• Multi-document question answering and information synthesis
• Legal document review and contract analysis
• Research paper analysis and literature review automation

Code & Development:
• Software engineering tasks with large codebase context (SWE-bench: 67.8%)
• Multi-file code generation and refactoring (Aider-Polyglot: 74.5%)
• Competitive programming with advanced algorithms (Codeforces: 2121 rating)
• Terminal and command-line task automation

Reasoning & Mathematics:
• Advanced mathematical problem solving (AIME 2025: 89.3%, HMMT 2025: 83.6%)
• Multi-step logical reasoning and proof generation
• Scientific research assistance and hypothesis generation
• STEM education and tutoring applications

Agentic Applications:
• Web search and browsing agents (BrowseComp: 40.1%)
• Automated information gathering and fact-checking (SimpleQA: 97.1%)
• Task automation and workflow orchestration
• Multi-step planning and execution with tool use
‍

Related models

Model specifications

Model data

Model provider
DeepSeek
Type
Code
LLM
Reasoning
Main use cases
Chat
Deployment
Monthly Reserved
Parameters
685B
Context length
128K
Input modalities
Text
Output modalities
Text

Category
Code

Quickstart docs

Deploy model

DeepSeek-V3.2-Exp

About model

Model card

Applications & use cases