Models / NVIDIA
Reasoning
Chat
Code

NVIDIA Nemotron 3 Ultra

Open frontier reasoning model for long-running autonomous agents and complex workflows

About model

NVIDIA Nemotron 3 Ultra is a 550B parameter (55B activated) open reasoning model built for long-running autonomous agents handling orchestration and complex tasks across coding, deep research, and enterprise workflows. Its hybrid Mamba-Transformer MoE architecture combines Latent MoE — which calls 4 experts at the inference cost of 1 — with Multi-Token Prediction for reduced generation time on long sequences, and Token Budget support for optimal accuracy with minimum reasoning token output. The model supports a 1M token context window and is fully open under the NVIDIA Open Model License with open weights, training data, and recipes.

Total Parameters (55B Activated)

550B

Hybrid Mamba-Transformer MoE with Latent MoE

Context Window

1M

Sustained reasoning across long-running agent sessions

Weights + Data + Recipes

Open

NVIDIA Open Model License for enterprise customization

Model key capabilities
  • Coding Agents: Architectural planning, complex multi-file refactors, and error recovery across large codebases — handling the hardest reasoning calls within end-to-end coding agent workflows
  • Deep Research: Sustained synthesis across large source sets, resolving contradictions and proposing novel hypotheses within research agent loops
  • Enterprise & EDA Workflows: Complex reasoning steps within persistent, tool-using agent loops across security, regulatory, clinical, and chip design domains including RTL generation and design verification
  • Efficient Architecture: Latent MoE runs 4 experts at the cost of 1, Multi-Token Prediction reduces generation time for long sequences, and Token Budget optimizes reasoning token usage — all within a 1M token context window
  • API usage

    • cURL
    • Python
    • Typescript

    Endpoint:

    nvidia/nemotron-3-ultra-550b-a55b

    curl -X POST "https://api.together.xyz/v1/chat/completions" \
      -H "Authorization: Bearer $TOGETHER_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{
        "model": "nvidia/nemotron-3-ultra-550b-a55b",
        "messages": [
          {
            "role": "user",
            "content": "What are some fun things to do in New York?"
          }
        ]
    }'
    
    from together import Together
    
    client = Together()
    
    response = client.chat.completions.create(
      model="nvidia/nemotron-3-ultra-550b-a55b",
      messages=[
        {
          "role": "user",
          "content": "What are some fun things to do in New York?"
        }
      ]
    )
    print(response.choices[0].message.content)
    
    import Together from 'together-ai';
    const together = new Together();
    
    const completion = await together.chat.completions.create({
      model: 'nvidia/nemotron-3-ultra-550b-a55b',
      messages: [
        {
          role: 'user',
          content: 'What are some fun things to do in New York?'
         }
      ],
    });
    
    console.log(completion.choices[0].message.content);
    
  • Model card

    Architecture Overview:
    • 550B total parameter MoE with 55B parameters activated per token
    • Hybrid Mamba-Transformer MoE architecture
    • Latent MoE: runs 4 experts at the inference cost of 1, improving intelligence at no added compute
    • Multi-Token Prediction (MTP): predicts multiple future tokens per forward pass, reducing generation time for long sequences
    • Token Budget: optimizes for accuracy with minimum reasoning token generation
    • 1M token context for sustained agent sessions and cross-document reasoning
    • NVFP4 precision optimized for Blackwell; FP8 and BF16 also supported

    Training Methodology:
    • Multi-environment RL training across agentic environments for reasoning, tool calling, and instruction following
    • Trained on NVIDIA-generated high-quality synthetic data from frontier open reasoning models
    • Open training recipes published for domain-specific customization

    Performance Characteristics:
    • Leading accuracy on the Artificial Analysis Intelligence Index among open models
    • Strong performance across reasoning, coding, and agentic task benchmarks
    • Token Budget support enables predictable inference cost on long-horizon tasks

  • Prompting

    Together AI API Access:
    • Access NVIDIA Nemotron 3 Ultra via Together AI APIs using the endpoint nvidia/nemotron-3-ultra-550b-a55b
    • Authenticate using your Together AI API key in request headers
    • Supports tool calling, Token Budget for cost-controlled reasoning, and extended context up to 1M tokens
    • Available on serverless and dedicated infrastructure

  • Applications & use cases

    Coding Agents:
    • Architectural planning and design decisions within week-long autonomous coding sessions
    • Complex multi-file refactors and end-to-end issue resolution across large codebases
    • Error recovery and iterative debugging within persistent agent loops

    Deep Research & Search:
    • Cross-referencing and synthesis across large source sets within sustained research agent loops
    • Contradiction resolution and novel hypothesis generation at the final synthesis stage
    • Long-context reasoning with 1M token window for extensive document sets

    Enterprise Agent Workflows:
    • Security alert triage, regulatory filing ingestion, and clinical trial orchestration within persistent tool-using loops
    • Complex reasoning steps within multi-step enterprise automation across industries

    EDA & Chip Design:
    • RTL generation from specifications and verification across thousands of constraints
    • Failure analysis and cross-block dependency resolution within chip design agent workflows
    • Design-to-manufacturing sign-off orchestration

Related models
  • Model provider
    NVIDIA
  • Type
    Reasoning
    Chat
    Code
  • Main use cases
    Reasoning
  • Features
    Function Calling
    JSON Mode
  • Deployment
    Serverless
    Monthly Reserved
  • Parameters
    550B
  • Activated parameters
    55B
  • Context length
    1M
  • Input price

    $0.60 / 1M tokens

    $0.20 (cached)/1M

  • Output price

    $3.60 / 1M tokens

  • Input modalities
    Text
  • Output modalities
    Text
  • Released
    May 30, 2026
  • Category
    Chat