DeepSeek-V3.2-Exp
Experimental sparse attention model for efficient long-context processing
About model
DeepSeek-V3.2-Exp is an experimental model that introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism designed to dramatically improve training and inference efficiency in long-context scenarios. Built on V3.1-Terminus, this model achieves substantial computational efficiency gains while maintaining virtually identical output quality and performance across diverse benchmarks including reasoning, coding, mathematics, and agentic tasks.
Model card
Architecture Overview:
• 685B total parameters with Mixture-of-Experts (MoE) architecture
• Multi-Latent Attention (MLA) with MQA mode for efficient key-value sharing
• 128K token context window with extended long-context capabilities
• DeepSeek Sparse Attention (DSA) featuring a lightning indexer and fine-grained token selection
Training Methodology:
• Continued pre-training from DeepSeek-V3.1-Terminus base checkpoint
• Two-stage training: dense warm-up (2.1B tokens) followed by sparse training (943.7B tokens)
• Lightning indexer trained with KL-divergence alignment to main attention distribution
• Post-training includes specialist distillation across mathematics, coding, reasoning, and agentic domains
• Group Relative Policy Optimization (GRPO) for reinforcement learning alignment
Performance Benchmarks:
DeepSeek-V3.2-Exp demonstrates performance on par with V3.1-Terminus across comprehensive evaluations:Benchmark DeepSeek-V3.1-Terminus DeepSeek-V3.2-Exp Reasoning Mode (General) MMLU-Pro 85.0 85.0 GPQA-Diamond 80.7 79.9 Humanity’s Last Exam 21.7 19.8 Code LiveCodeBench 74.9 74.1 Codeforces-Div1 2046 2121 Aider-Polyglot 76.1 74.5 Math AIME 2025 88.4 89.3 HMMT 2025 86.1 83.6 Agentic Tool Use BrowseComp 38.5 40.1 BrowseComp-zh 45.0 47.9 SimpleQA 96.8 97.1 SWE Verified 68.4 67.8 SWE-bench Multilingual 57.8 57.9 Terminal-bench 36.7 37.7 Efficiency Characteristics:
• Reduces core attention complexity from O(L²) to O(Lk) where k≪L
• Up to 70% cost reduction for long-context inference at 128K tokens
• Selects 2048 key-value tokens per query token during sparse attention
• Optimized for H800, H200, MI350, and NPU deployments with specialized kernelsApplications & use cases
Long-Context Processing:
• Extended document analysis and summarization up to 128K tokens
• Multi-document question answering and information synthesis
• Legal document review and contract analysis
• Research paper analysis and literature review automation
Code & Development:
• Software engineering tasks with large codebase context (SWE-bench: 67.8%)
• Multi-file code generation and refactoring (Aider-Polyglot: 74.5%)
• Competitive programming with advanced algorithms (Codeforces: 2121 rating)
• Terminal and command-line task automation
Reasoning & Mathematics:
• Advanced mathematical problem solving (AIME 2025: 89.3%, HMMT 2025: 83.6%)
• Multi-step logical reasoning and proof generation
• Scientific research assistance and hypothesis generation
• STEM education and tutoring applications
Agentic Applications:
• Web search and browsing agents (BrowseComp: 40.1%)
• Automated information gathering and fact-checking (SimpleQA: 97.1%)
• Task automation and workflow orchestration
• Multi-step planning and execution with tool use