Kimi K2.5
State-of-the-art multimodal thinking agent with vision and Agent Swarm
About model
Kimi K2.5 is Moonshot AI's most capable open-source thinking model, built as a thinking agent that reasons step-by-step while dynamically invoking tools. Setting new state-of-the-art records on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks, K2.5 dramatically scales multi-step reasoning depth while maintaining stable tool-use across 200–300 sequential calls — a breakthrough in long-horizon agency with native INT4 quantization for 2x inference speed.
50.2%
Expert-level multimodal reasoning across 100+ subjects
15T
Native multimodal pretraining at scale
2x
Native INT4 quantization with QAT
- Native Multimodality: Pre-trained on vision-language tokens, excels in visual knowledge, cross-modal reasoning, and agentic tool use grounded in visual inputs
- Coding with Vision: Generates code from visual specifications (UI designs, video workflows) and autonomously chains tools for visual data processing
- Agent Swarm: Transitions from single-agent scaling to self-directed, coordinated swarm-like execution—decomposes complex tasks into parallel sub-tasks executed by dynamically instantiated, domain-specific agents
- Production-Ready Efficiency: Native INT4 quantization achieving lossless 2x speed improvements with 256K context window
Model | AIME 2025 | GPQA Diamond | HLE | LiveCodeBench | MATH500 | SWE-bench verified |
|---|---|---|---|---|---|---|
Kimi K2.5 | 87.6% | 24.4% | 73.8% | Related open-source models | Competitor closed-source models | |
90.5% | 34.2% | 78.7% | ||||
83.3% | 24.9% | 99.2% | 62.3% | |||
76.8% | 96.4% | 48.9% | ||||
49.2% | 2.7% | 32.3% | 89.3% | 31.0% |
API usage
Endpoint:
Model card
Architecture Overview:
• Mixture-of-Experts (MoE) architecture with 1T total parameters and 32B activated parameters
• 61 total layers including 1 dense layer with 384 experts selecting 8 per token
• Multi-head Latent Attention (MLA) mechanism with 7168 attention hidden dimension
• Native vision encoder: MoonViT with 400M parameters for vision-language integration
• Native INT4 quantization applied to MoE components through Quantization-Aware Training (QAT)
• 256K context window enabling complex long-horizon multimodal agentic tasks
• 160K vocabulary size with SwiGLU activation function
• Unified architecture combining vision and text, instant and thinking modes, conversational and agentic paradigms
Training Methodology:
• Continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base
• Native multimodal training—pre-trained on vision-language tokens for seamless cross-modal reasoning
• End-to-end trained to interleave chain-of-thought reasoning with function calls and visual grounding
• Quantization-Aware Training (QAT) employed for lossless INT4 inference with 2x speed
• Agent Swarm training—transitions from single-agent scaling to self-directed, coordinated swarm-like execution
• Specialized training for parallel task decomposition and domain-specific agent instantiation
Key Capabilities:
• Native Multimodality: Excels in visual knowledge, cross-modal reasoning, and agentic tool use grounded in visual inputs
• Coding with Vision: Generates code from visual specifications (UI designs, video workflows) and autonomously chains tools for visual data processing
• Agent Swarm: Decomposes complex tasks into parallel sub-tasks executed by dynamically instantiated, domain-specific agents
• Vision benchmarks: 78.5% MMMU-Pro, 84.2% MathVision, 90.1% MathVista, 77.5% CharXiv reasoning
Performance Characteristics:
• State-of-the-art 50.2% on Humanity's Last Exam (HLE) with tools across 100+ expert subjects
• Advanced mathematical reasoning: 96.1% AIME 2025, 95.4% HMMT 2025, 81.8% IMO-AnswerBench, 87.4% GPQA-Diamond
• Strong coding capabilities: 76.8% SWE-Bench Verified, 73.0% SWE-Bench Multilingual, 85.0% LiveCodeBench v6
• Agentic search with swarm: 78.4% BrowseComp (swarm mode), 57.5% Seal-0
• Long-context excellence: 79.3% on AA-LCR (avg@3), 69.4% LongBench-v2 (128K context)
• 2x generation speed improvement through native INT4 quantization without performance degradation
Applications & use cases
Multimodal Agentic Reasoning:
• Expert-level reasoning across 100+ subjects achieving 50.2% on Humanity's Last Exam with tools
• Vision-grounded reasoning: 78.5% MMMU-Pro, 84.2% MathVision, 90.1% MathVista
• Cross-modal problem solving combining visual understanding with mathematical and logical reasoning
• PhD-level mathematical problem solving: 96.1% AIME 2025, 95.4% HMMT 2025
• Dynamic hypothesis generation from visual and textual inputs with evidence verification
Coding with Vision:
• Generate code from visual specifications: UI designs, mockups, and video workflows
• Autonomous tool chaining for visual data processing and analysis
• Production-level coding: 76.8% SWE-Bench Verified, 73.0% SWE-Bench Multilingual
• Frontend development from visual designs: fully functional HTML, React, and responsive web applications
• Video-to-code generation: analyze video workflows and generate implementation code
• Competitive programming: 85.0% LiveCodeBench v6, 53.6% OJ-Bench
Agent Swarm Orchestration:
• Self-directed task decomposition into parallel sub-tasks
• Dynamically instantiate domain-specific agents for coordinated execution
• Swarm mode performance: 62.3% BrowseComp, 19.4% WideSearch
• Complex research workflows with parallel information gathering and synthesis
• Multi-agent coding projects with specialized sub-agents for different components
Visual Understanding & Analysis:
• Native image and video understanding with 400M parameter MoonViT encoder
• Chart and graph reasoning: 77.5% CharXiv reasoning questions
• Document understanding and visual question answering
• Scientific visualization analysis and interpretation
• UI/UX design understanding for code generation
Agentic Search & Web Reasoning:
• Goal-directed web-based reasoning with visual content understanding
• Continuous browsing, searching, and reasoning over multimodal web information
• 62.3% BrowseComp in swarm mode with coordinated sub-agent exploration
• Visual content extraction and analysis from web sources
Long-Horizon Multimodal Workflows:
• Research automation across text and visual sources
• Video analysis workflows with tool-augmented reasoning
• Complex design-to-implementation pipelines
• Multi-step visual data processing and code generation
• 79.3% AA-LCR (avg@3), 69.4% LongBench-v2 with 128K context
Creative & Multimodal Content Generation:
• Image-grounded creative writing and storytelling
• Visual analysis and cultural commentary
• Technical documentation from visual specifications
• Educational content combining visual and textual explanations
- Model providerMoonshot AI
- TypeChatCodeLLM
- Main use casesVision
- SpeedMedium
- IntelligenceVery High
- DeploymentServerlessOn-Demand DedicatedMonthly Reserved
- Endpoint
- Parameters1T
- Activated parameters32B
- Context length262K
- Input price
$0.50 / 1M tokens
- Output price
$2.80 / 1M tokens
- Input modalitiesTextImage
- Output modalitiesText
- ReleasedDecember 31, 2025
- Last updatedJanuary 26, 2026
- Quantization levelINT4
- External link
- CategoryChat