How does Perplexity use multiple AI models simultaneously?

Perplexity is an orchestration system that routes different parts of a search task to whichever model is best suited for that job. At any given moment it coordinates up to 20 different models, including Claude, GPT-4o, Gemini, and Nemotron 3 Super, depending on what each step of the query requires.

What does Nemotron 3 Super mean for AI search results?

Higher throughput means faster answers for users and lower cost per query for platforms like Perplexity. As search infrastructure becomes faster and cheaper to run, AI search platforms can handle more queries simultaneously, which accelerates the shift away from traditional search.

NVIDIA Nemotron 3 Super Is Now Inside Perplexity. Here Is What That Means for Search.

Q: What is NVIDIA Nemotron 3 Super?

Nemotron 3 Super is a 120-billion-parameter AI model released by NVIDIA in March 2026. It delivers 5x higher throughput than its predecessor and generates tokens over 50% faster than comparable models at equivalent hardware costs.

Q: Why did Perplexity integrate Nemotron 3 Super?

Perplexity integrated Nemotron 3 Super because it generates 429.6 tokens per second — compared to a median of 76.6 for similar open-weight models — making it significantly faster and cheaper to run at scale across Perplexity's multi-model orchestration system.

Q: Does Nemotron 3 Super affect AEO strategy?

Indirectly, yes. Faster, cheaper inference means Perplexity can run more complex multi-step reasoning at scale, which raises the bar for content quality. Well-structured, factually precise content with clear answer capsules will have an increasing advantage as the models evaluating it become more capable.

March 15, 2026

NVIDIA Nemotron 3 Super inside Perplexity AI search

NVIDIA Nemotron 3 Super is a 120-billion-parameter open AI model that Perplexity integrated across its search bar, Agent API, and Perplexity Computer in March 2026. The model delivers 5x higher throughput than its predecessor, generates over 429 tokens per second, and uses a hybrid Mamba-Transformer mixture-of-experts architecture with only 12 billion active parameters. Perplexity is also a founding member of the NVIDIA Nemotron Coalition, an open-model alliance that includes Mistral AI. For AEO practitioners, the practical takeaway is that Perplexity now routes complex reasoning steps to a model that is significantly faster and cheaper per query, which raises the bar for what content gets cited inside AI search.

NVIDIA released Nemotron 3 Super on March 11, 2026, a 120-billion-parameter open AI model that delivers 5x higher throughput than its predecessor and generates tokens over 50% faster than comparable models. Perplexity integrated it the same day. For anyone serious about AEO and AI search visibility, this launch is worth understanding, because the infrastructure powering search just got significantly faster and cheaper to run at scale.

There is a question that does not get asked enough when a major AI model launches: who is using it, and what does it change about how search actually works? With Nemotron 3 Super and Perplexity, the answer to both is unusually clear. Perplexity, one of the most important AI search platforms for AEO practitioners right now, is running Nemotron 3 Super as one of 20 orchestrated models inside its Computer product, in its core search interface, and through its Agent API for developers. That is not a future roadmap item. It is live.

What Is Nemotron 3 Super and Why Did Perplexity Integrate It?

Quick answer: Nemotron 3 Super is NVIDIA’s 120-billion-parameter open reasoning model designed for agentic AI. Perplexity integrated it across search, Agent API, and Perplexity Computer because it delivers 5x higher throughput than the previous Nemotron Super and tops the Artificial Analysis benchmark for efficiency among same-size models, which lowers Perplexity’s cost per query while improving answer speed.

Nemotron 3 Super is part of NVIDIA’s Nemotron 3 family of open-weight models, following the smaller Nemotron 3 Nano released in December 2025. Where Nano was tuned for smaller targeted tasks, Super is positioned to run complex agentic AI systems at scale. It has 120 billion total parameters but only 12 billion active at any given moment, thanks to a hybrid mixture-of-experts architecture that lets it match the capability of much larger dense models at a fraction of the inference cost.

Perplexity’s integration was not a passive partnership. The company is a founding member of the NVIDIA Nemotron Coalition, an alliance of AI labs collaborating on open base models, with its first project a co-development with Mistral AI. Perplexity contributes data, evaluations, and domain expertise from frontier model development, and in return it gets first-mover access to the most efficient open reasoning models on the market. That is the context for why Nemotron 3 Super was inside Perplexity’s search pipeline the day NVIDIA released it.

What Is Throughput and Why Does It Matter for AI Search?

Throughput, in the context of AI models, means how much output a model can generate in a given amount of time. Measured in tokens per second, it is the single most important efficiency metric for any AI system running at production scale.

Higher throughput means two things simultaneously: faster answers for users, and lower cost per query for the platforms serving those answers. When a model processes more tokens per second on the same hardware, the cost of each individual token drops. At the scale Perplexity operates, handling millions of queries every day across multiple AI models running in parallel, that efficiency difference compounds into a significant operational advantage.

Nemotron 3 Super generates 429.6 tokens per second based on independent benchmarking, compared to a median of 76.6 tokens per second for open-weight models of similar size. Against specific competitors, it achieves 2.2x higher throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B on equivalent hardware, while delivering comparable or better accuracy across key reasoning benchmarks. That is not a marginal improvement. It is a fundamentally different performance tier.

Metric	Nemotron 3 Super	Category Median
Total parameters	120 billion	100–130 billion
Active parameters	12 billion	All (dense models)
Tokens per second	429.6	76.6
Context window	1,000,000 tokens	128,000–200,000
Throughput vs predecessor	5x higher	Baseline
Accuracy vs predecessor	2x higher	Baseline

What Two Problems Does Nemotron 3 Super Solve for Perplexity?

To understand why this model matters for AI search specifically, you need to understand the two constraints that make agentic AI expensive and slow at scale. Both directly affect how Perplexity routes queries through its multi-model pipeline.

The Thinking Tax

Every time an agentic system needs to reason through a complex subtask, it uses a large, capable model to do it. But running a frontier-scale reasoning model for every single step in a multi-agent workflow is prohibitively expensive and slow. NVIDIA calls this the thinking tax, and it is the core reason why agentic AI has been difficult to deploy at production scale.

Nemotron 3 Super addresses this through its hybrid mixture-of-experts architecture, where only 12 billion of its 120 billion parameters are active at any given moment. The model uses latent MoE, which calls 4x as many expert specialists for the same inference cost by compressing tokens before they reach the experts. The result is reasoning capability comparable to much larger models at a fraction of the cost per query. For Perplexity, that means the model is economically viable to deploy on reasoning-heavy steps where slower or denser models would blow out the budget.

Context Explosion

Multi-agent workflows generate up to 15x more tokens than standard chat interactions because each step in the workflow requires resending full conversation histories, tool outputs, and intermediate reasoning to the next model in the chain. Over long tasks, this ballooning context causes goal drift, where the agent gradually loses alignment with the original objective and starts producing irrelevant or inconsistent outputs.

Nemotron 3 Super’s 1-million-token context window solves this by allowing the entire workflow to stay in memory at once, without truncation or loss of context. The architecture interleaves Mamba-2 state space layers, which give linear-time complexity over long sequences, with transformer attention layers that handle precise associative recall. That hybrid is what makes a 1-million-token window practical at inference rather than theoretical on a benchmark sheet.

How Does Higher Throughput Actually Affect Perplexity Users?

The simplest way to think about throughput is this: it is how fast a model can do work, and how much it costs to do that work.

Think of two toll booths on a highway. One processes 100 cars per hour. The other processes 500 cars per hour. Same road, same destination, but one is dramatically more efficient. The faster booth handles more traffic, costs less per car to operate, and keeps the highway moving. That is what higher throughput means for an AI model running inside a search engine.

For Perplexity specifically, running a model that generates tokens over 50% faster than comparable alternatives means users get answers quicker, more queries can run simultaneously on the same infrastructure, and the cost per search drops. Nemotron 3 Super also currently holds the top spot on the Artificial Analysis leaderboard for efficiency and openness among same-size models, and it powers the NVIDIA AI-Q research agent to the number one position on DeepResearch Bench and DeepResearch Bench II, two benchmarks that measure multi-step research quality across large document sets. That benchmark dominance is exactly the profile Perplexity needs for its deep research and Comet browsing workflows.

How Does Perplexity Use Nemotron 3 Super Across Its Products?

Quick answer: Perplexity uses Nemotron 3 Super in three places: as a selectable model in the main Perplexity search bar, through its Agent API for developers building on Perplexity’s reasoning stack, and as one of 20 orchestrated models inside Perplexity Computer. It is available to Pro subscribers.

Perplexity is not a single AI model. It is an orchestration system that routes different parts of a search task to whichever model is best suited for that specific job. At any given moment it is coordinating up to 20 different models, including Claude Sonnet 4.6, GPT-5.4, Gemini 3.1 Pro, and now Nemotron 3 Super, depending on what each step of the query requires. As Perplexity itself describes its approach, the company post-trains different open models for each stage of answering a question, from query parsing and retrieval to reranking and drafting the final response, which lets it precisely tune latency, cost, and relevance.

Adding a model that handles complex multi-step reasoning at 5x the throughput of previous options is not a minor upgrade. It means Perplexity can route more of its demanding reasoning tasks to a model that costs less per token and returns results faster, without sacrificing the quality of the answer. That directly improves the user experience and the economics of running the platform simultaneously.

The broader implication is that no single AI company owns search anymore. The answer your audience sees on Perplexity is assembled from the best available model for each step of their query, in real time. Nemotron 3 Super made itself the most efficient option for the reasoning-heavy steps of that process the moment it shipped.

Who Else Is Using Nemotron 3 Super Besides Perplexity?

Perplexity got the headlines, but the broader adoption picture matters for understanding where Nemotron 3 Super sits in the AI search infrastructure stack. Coding agent platforms like CodeRabbit, Factory, and Greptile are integrating the model into their AI agents alongside proprietary models to achieve higher accuracy at lower cost. Life sciences and frontier AI organizations including Edison Scientific and Lila Sciences are using it to power agents for deep literature search, data science, and molecular understanding.

On the enterprise side, Amdocs, Palantir, Cadence, Dassault Systèmes, and Siemens are deploying and customizing the model to automate workflows in telecom, cybersecurity, semiconductor design, and manufacturing. Distribution-wise, Nemotron 3 Super is accessible at build.nvidia.com, OpenRouter, Hugging Face, and Perplexity, with cloud availability through Google Cloud Vertex AI, Oracle Cloud Infrastructure, AWS Bedrock, and Microsoft Azure. Dell and HPE are also bringing the model to their respective enterprise AI hubs.

The pattern is clear: this is not a model that lives in one product. It is becoming default infrastructure for agentic systems across consumer search, developer tools, and enterprise workflows. Which is exactly why AEO practitioners need to pay attention to it.

What Does Nemotron 3 Super Mean for AEO and AI Search Visibility?

As the infrastructure powering AI search gets faster and more efficient, the volume and complexity of queries these systems can handle scales up dramatically. That is good for users. For brands and content creators, it raises a question: is your content built to be retrieved and cited inside these more capable, more demanding systems?

Faster, more efficient models do not lower the bar for what gets cited. They raise it. When a model can process more context, reason more deeply, and cross-reference more sources in the same amount of time, the competitive pressure on content quality and structure increases. Shallow content that answers one question adequately gets passed over in favor of content that demonstrates genuine topical depth, clear structure, and consistent authority signals across multiple sources.

The practical priorities have not changed. Direct answers at the top of every major section. FAQ structure with full questions as H2 headers. Content clusters that build topical authority across interconnected articles. Off-site mentions and earned media that give AI systems independent signals of credibility. What has changed is the urgency. The systems evaluating your content are getting more capable every quarter.

For a full breakdown of how to build content that performs inside these systems, the Prompt Insider AEO resource hub covers structure, authority signals, and measurement frameworks in detail. And if you want to understand the agentic AI architecture powering these search engines at a deeper level, our breakdown of what agentic AI is and how it works explains the full multi-model orchestration model in plain language.

NVIDIA building the most efficient open reasoning model on the market and Perplexity shipping it on day one is not a coincidence. It is a signal about where the infrastructure for AI search is heading. The platforms are getting faster, more capable, and more demanding. The brands showing up consistently in their answers are the ones building for that now.

Frequently Asked Questions (FAQs)

What is NVIDIA Nemotron 3 Super?

NVIDIA Nemotron 3 Super is a 120-billion-parameter open AI model released on March 11, 2026, designed specifically for agentic AI systems that run complex multi-step tasks. It has only 12 billion parameters active at any moment thanks to its hybrid Mamba-Transformer mixture-of-experts architecture, making it dramatically more efficient than dense models of comparable capability. It delivers 5x higher throughput than its predecessor and generates tokens over 50% faster than similar models, with a 1-million-token context window that allows entire agent workflows to stay in memory without losing track of the original goal.

Is Nemotron 3 Super available in Perplexity?

Yes. Nemotron 3 Super is available across three Perplexity surfaces: the main Perplexity search bar as a selectable model, the Perplexity Agent API for developers, and Perplexity Computer as one of 20 orchestrated models. Access is available with a Perplexity Pro subscription. The integration went live in March 2026, the same week NVIDIA released the model, because Perplexity is a founding member of the NVIDIA Nemotron Coalition.

Why did Perplexity integrate Nemotron 3 Super?

Perplexity runs an agentic search system that orchestrates up to 20 different AI models simultaneously, routing each step of a complex query to the most capable and efficient model for that specific job. Nemotron 3 Super delivers the highest throughput among open models of its size while maintaining strong reasoning accuracy, making it the most cost-efficient option for the demanding multi-step reasoning tasks inside Perplexity’s search pipeline. Faster token generation and lower cost per query at Perplexity’s scale translates directly to better user experience and lower operating costs.

What is the Nemotron Coalition and is Perplexity part of it?

The NVIDIA Nemotron Coalition is an alliance of AI labs collaborating on open base models, with all outputs released open source. Perplexity is a founding member alongside Mistral AI, which co-developed the coalition’s first base model with NVIDIA. Coalition members contribute data, evaluations, and domain expertise from frontier model development. The coalition matters for AI search because it formalizes a path for open models to keep pace with closed frontier models, which directly affects which models AI search engines can run at scale.

What does throughput mean for AI models?

Throughput in AI models refers to how many tokens a model can generate per second. Higher throughput means the model produces output faster and costs less per token to run on the same hardware. For a search platform handling millions of queries daily across multiple models running in parallel, throughput is one of the most important operational metrics. A model that generates 429 tokens per second versus the category average of 76 tokens per second is not just faster, it is economically viable to deploy at a scale where a slower model would be prohibitively expensive.

How does Nemotron 3 Super compare to GPT-OSS-120B and Qwen3.5?

On equivalent hardware, Nemotron 3 Super achieves 2.2x higher throughput than GPT-OSS-120B and 7.5x higher throughput than Qwen3.5-122B, while delivering comparable or better accuracy on key reasoning benchmarks. It also currently holds the number one position on the Artificial Analysis leaderboard for efficiency and openness among same-size models, and it powers the NVIDIA AI-Q research agent to the top spot on DeepResearch Bench and DeepResearch Bench II.

How does Nemotron 3 Super affect AEO strategy?

Faster, more efficient AI models enable search platforms to handle more complex queries, process deeper context, and cross-reference more sources in the same amount of time. That raises the bar for what content gets cited, not lowers it. Brands optimizing for answer engine optimization need content that holds up under deeper scrutiny: structured for extraction, built around topical depth, and supported by consistent off-site authority signals. The fundamentals of AEO do not change as the underlying models get more capable. The margin for content that barely clears the bar gets smaller.

Kai Williams

Kai Williams has been in marketing for years, with a long background in SEO before AEO had a name. He stepped into Answer Engine Optimization the moment AI started reshaping how people search, and has been tracking the shift ever since. At Prompt Insider, he covers AEO, AI marketing, and the future of search, breaking down what is actually changing and what brands need to do about it.