NVIDIA Nemotron 3 Super 120B A12B

NVIDIA Nemotron 3 Super 120B A12B is NVIDIA's 120B total, 12B active-parameter hybrid Mamba-Transformer MoE built for complex multi-agent applications, featuring latent MoE and multi-token prediction.

ReasoningTool Use

index.ts

import { streamText } from 'ai'

const result = streamText({
  model: 'nvidia/nemotron-3-super-120b-a12b',
  prompt: 'Why is the sky blue?'
})

Overview About Providers Throughput Latency Uptime Status Similar FAQ

Playground

Try out NVIDIA Nemotron 3 Super 120B A12B by NVIDIA. Usage is billed to your team at API rates. Free users (those who haven't made a payment) get $5 of credits every 30 days.

NVIDIA Nemotron 3 Super 120B A12B

Ask NVIDIA Nemotron 3 Super 120B A12B anything to try it out.

Providers

Route requests across multiple providers. Copy a provider slug to set your preference. Visit the docs for more info. Using a provider means you agree to their terms, listed under Legal.

Provider

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	ZDR	No Training	Release Date

Amazon Bedrock

256K

1.8s

179tps

$0.15/M

$0.65/M

—

03/11/2026

Baseten

256K

0.2s

$0.30/M

$0.75/M

Read:$0.06/M

Write:—

—

03/11/2026

More models by NVIDIA

Model

Context	Latency	Throughput	Input	Output	Cache	Web Search	Per Query	Capabilities	Providers	ZDR	No Training	Release Date

nvidia/nemotron-3-ultra-550b-a55b

0.2s

286tps

$0.37/M

$1.08/M

Read:$0.12/M

Write:—

—

06/04/2026

nvidia/nemotron-3-nano-30b-a3b

262K

0.2s

76tps

$0.05/M

$0.24/M

—

12/15/2025

nvidia/nemotron-nano-12b-v2-vl

131K

0.2s

$0.20/M

$0.60/M

—

10/28/2025

nvidia/nemotron-nano-9b-v2

131K

0.2s

172tps

$0.06/M

$0.23/M

—

08/18/2025

About NVIDIA Nemotron 3 Super 120B A12B

NVIDIA released NVIDIA Nemotron 3 Super 120B A12B on March 11, 2026 as the second model in the Nemotron 3 family, following Nano. It has 120B total parameters and 12B active parameters per token. The hybrid Mamba-Transformer MoE backbone interleaves Mamba-2 layers for long-sequence processing, Transformer attention layers for precise recall, and MoE layers for compute efficiency. NVIDIA Nemotron 3 Super 120B A12B delivers higher throughput than the previous Nemotron Super generation.

Two architectural innovations distinguish Super from Nano. First, latent MoE: before routing, token embeddings compress into a low-rank latent space. This lets the model consult 4x as many expert specialists at the same inference cost. Finer-grained routing allows distinct experts to activate for different subtasks (Python syntax, SQL logic, multi-hop reasoning) without paying the compute cost of running them all. Second, multi-token prediction (MTP): the model predicts multiple future tokens in a single forward pass. MTP strengthens reasoning during training and provides built-in speculative decoding at inference, yielding up to 3x speedups on structured generation tasks like code and tool calls.

On PinchBench (a benchmark evaluating LLMs as the planning brain of an OpenClaw agent), NVIDIA Nemotron 3 Super 120B A12B scores 85.6%. Full announcement: https://docs.aws.amazon.com/en_us/bedrock/latest/userguide/model-card-nvidia-nemotron-super-3-120b.html.

What To Consider When Choosing a Provider

Configuration: NVIDIA Nemotron 3 Super 120B A12B's multi-agent orientation means it works best as the planning and reasoning backbone in a pipeline where lighter models handle individual steps. Evaluate your task decomposition before choosing a tier. Compare $0.15 and $0.65.
Zero Data Retention: AI Gateway supports Zero Data Retention for this model via direct gateway requests (BYOK is not included). To configure this, check the documentation.
Authentication: AI Gateway authenticates requests using an API key or OIDC token. You do not need to manage provider credentials directly.

When to Use NVIDIA Nemotron 3 Super 120B A12B

Best For

Complex multi-agent applications: Software development pipelines or cybersecurity triaging that require deep planning across long contexts
Context explosion workloads: Multi-agent systems with up to 15x the token volume of standard chats that cause goal drift with smaller models
Dense technical problem-solving: Tasks where higher parameter count provides reasoning headroom
Super plus nano pattern: Agentic pipelines pairing Super for complex decisions with Nano for efficient individual steps
Fully open model requirement: Teams that need weights and recipes for enterprise customization, data control, or reproducibility

Consider Alternatives When

Simpler task steps: Nemotron 3 Nano is more throughput-efficient for lighter workloads
Vision-language inputs: Super is text-only; Nemotron Nano 12B v2 VL supports multimodal inputs
Cost-first constraints: A lighter model may deliver acceptable quality at lower cost per token

Conclusion

NVIDIA Nemotron 3 Super 120B A12B combines latent MoE for expert specialization and multi-token prediction for inference speedups. Route requests through AI Gateway as the planning and reasoning backbone for complex multi-agent applications at scale.

Agent Stack

Core Platform

Tools

Learn

Build

Explore

NVIDIA Nemotron 3 Super 120B A12B

Playground

Providers

More models by NVIDIA

About NVIDIA Nemotron 3 Super 120B A12B

What To Consider When Choosing a Provider

When to Use NVIDIA Nemotron 3 Super 120B A12B

Best For

Consider Alternatives When

Conclusion