Models / DeepSeek
Code
LLM
Reasoning

DeepSeek-V3.2-Exp

Experimental sparse attention model for efficient long-context processing

About model

DeepSeek-V3.2-Exp is an experimental model that introduces DeepSeek Sparse Attention (DSA), a fine-grained sparse attention mechanism designed to dramatically improve training and inference efficiency in long-context scenarios. Built on V3.1-Terminus, this model achieves substantial computational efficiency gains while maintaining virtually identical output quality and performance across diverse benchmarks including reasoning, coding, mathematics, and agentic tasks.

  • Model card

    Architecture Overview:
    • 685B total parameters with Mixture-of-Experts (MoE) architecture
    • Multi-Latent Attention (MLA) with MQA mode for efficient key-value sharing
    • 128K token context window with extended long-context capabilities
    • DeepSeek Sparse Attention (DSA) featuring a lightning indexer and fine-grained token selection

    Training Methodology:
    • Continued pre-training from DeepSeek-V3.1-Terminus base checkpoint
    • Two-stage training: dense warm-up (2.1B tokens) followed by sparse training (943.7B tokens)
    • Lightning indexer trained with KL-divergence alignment to main attention distribution
    • Post-training includes specialist distillation across mathematics, coding, reasoning, and agentic domains
    • Group Relative Policy Optimization (GRPO) for reinforcement learning alignment

    Performance Benchmarks:
    DeepSeek-V3.2-Exp demonstrates performance on par with V3.1-Terminus across comprehensive evaluations:

    BenchmarkDeepSeek-V3.1-TerminusDeepSeek-V3.2-Exp
    Reasoning Mode (General)
    MMLU-Pro85.085.0
    GPQA-Diamond80.779.9
    Humanity’s Last Exam21.719.8
    Code
    LiveCodeBench74.974.1
    Codeforces-Div120462121
    Aider-Polyglot76.174.5
    Math
    AIME 202588.489.3
    HMMT 202586.183.6
    Agentic Tool Use
    BrowseComp38.540.1
    BrowseComp-zh45.047.9
    SimpleQA96.897.1
    SWE Verified68.467.8
    SWE-bench Multilingual57.857.9
    Terminal-bench36.737.7

    Efficiency Characteristics:
    • Reduces core attention complexity from O(L²) to O(Lk) where k≪L
    • Up to 70% cost reduction for long-context inference at 128K tokens
    • Selects 2048 key-value tokens per query token during sparse attention
    • Optimized for H800, H200, MI350, and NPU deployments with specialized kernels

  • Applications & use cases

    Long-Context Processing:
    • Extended document analysis and summarization up to 128K tokens
    • Multi-document question answering and information synthesis
    • Legal document review and contract analysis
    • Research paper analysis and literature review automation

    Code & Development:
    • Software engineering tasks with large codebase context (SWE-bench: 67.8%)
    • Multi-file code generation and refactoring (Aider-Polyglot: 74.5%)
    • Competitive programming with advanced algorithms (Codeforces: 2121 rating)
    • Terminal and command-line task automation

    Reasoning & Mathematics:
    • Advanced mathematical problem solving (AIME 2025: 89.3%, HMMT 2025: 83.6%)
    • Multi-step logical reasoning and proof generation
    • Scientific research assistance and hypothesis generation
    • STEM education and tutoring applications

    Agentic Applications:
    • Web search and browsing agents (BrowseComp: 40.1%)
    • Automated information gathering and fact-checking (SimpleQA: 97.1%)
    • Task automation and workflow orchestration
    • Multi-step planning and execution with tool use

Related models
  • Model provider
    DeepSeek
  • Type
    Code
    LLM
    Reasoning
  • Main use cases
    Chat
  • Deployment
    Monthly Reserved
  • Parameters
    685B
  • Context length
    128K
  • Input modalities
    Text
  • Output modalities
    Text
  • Category
    Code