Qwen3-Max-Thinking
Qwen3-Max-Thinking is Alibaba Cloud's most powerful AI model — a 1+ trillion parameter Mixture-of-Experts system that tops multiple frontier benchmarks. Launched as Qwen3-Max in September 2025 and upgraded with test-time scaling (TTS) in January 2026, it scores a perfect 100% on AIME 2025, leads Humanity's Last Exam with search, and beats GPT-5.2 on math olympiad tasks. The model is API-only (proprietary), accessible through Alibaba Cloud's DashScope with OpenAI-compatible endpoints. For the complete Qwen 3 ecosystem, see the Qwen 3 overview.
What sets Qwen3-Max-Thinking apart from other frontier models isn't just raw scale — it's how it uses that scale at inference time. The test-time scaling mechanism doesn't rely on naive best-of-N sampling. Instead, it employs an experience-cumulative, multi-round reasoning strategy that progressively refines answers across multiple internal passes. The result: state-of-the-art scores on the hardest reasoning benchmarks available, including several where it outperforms both GPT-5.2 and Gemini 3 Pro.
In This Guide
Key Specifications
| Developer | Alibaba Cloud — Qwen Team |
| Model Name | Qwen3-Max-Thinking (snapshot: qwen3-max-2026-01-23) |
| Parameters | 1+ trillion (Mixture-of-Experts) |
| Architecture | MoE with global-batch load-balancing loss |
| Context Window | 256,000 tokens (up to 1M referenced) |
| Max Output Tokens | 131,072 tokens |
| Languages | 100+ |
| Pre-training Data | 36 trillion tokens |
| Training Method | Two-stage: fine-tuning + reinforcement learning |
| Thinking Mode | Toggle via enable_thinking API parameter |
| Output Speed | ~38 tokens/second |
| License | Proprietary (API-only, closed source) |
| API Compatibility | OpenAI-compatible + Anthropic-compatible |
| Initial Release | September 5, 2025 (Qwen3-Max) |
| TTS Upgrade | January 27, 2026 (Qwen3-Max-Thinking) |
Test-Time Scaling: How It Works
Test-time scaling (TTS) is the key innovation that elevates Qwen3-Max from a strong model to a benchmark leader. Unlike standard inference where the model generates one answer, TTS allows the model to trade compute for intelligence at inference time.
The mechanism works through what Alibaba describes as an experience-cumulative, multi-round reasoning strategy:
- Multi-pass refinement: The model revisits and refines its reasoning across multiple internal rounds, building on previous attempts rather than starting from scratch each time.
- Adaptive tool integration: During reasoning, the model can invoke built-in tools — search, memory, and a code interpreter — to gather additional information or verify computations.
- Controllable latency: Developers can adjust the depth of reasoning to balance accuracy against response time. Lighter tasks skip deep reasoning; hard problems get the full treatment.
The practical effect is dramatic. On AIME 2025 (American Invitational Mathematics Examination), standard mode scores 81.6 — but with test-time scaling enabled, the score jumps to a perfect 100%. Similar gains appear across GPQA Diamond (+5.4 points), LiveCodeBench (+5.5), and HLE with search (+8.5).
Dual Mode Operation
Qwen3-Max-Thinking operates in two modes, controlled by the enable_thinking API parameter:
- Thinking Mode (
enable_thinking: true) — Activates chain-of-thought reasoning and test-time scaling. Lower temperature and top_p settings are recommended. Best for complex math, STEM, multi-hop reasoning, and agentic tasks. - Non-Thinking Mode (
enable_thinking: false) — Fast direct responses without chain-of-thought. Ideal for general chat, search, customer support, and latency-sensitive applications.
Complete Benchmark Results
Standard Mode (Without Test-Time Scaling)
| Benchmark | Score | Category |
|---|---|---|
| GPQA Diamond | 87.4 | PhD-level Science |
| MMLU-Pro | 85.7 | Knowledge |
| MMLU-Redux | 92.8 | Knowledge |
| C-Eval | 93.7 | Chinese Knowledge |
| HLE | 30.2 | Extreme Difficulty |
| HLE (with search) | 49.8 | Agentic Reasoning |
| LiveCodeBench v6 | 85.9 | Coding |
| HMMT Feb 25 | 98.0 | Math Competition |
| HMMT Nov 25 | 94.7 | Math Competition |
| IMO-AnswerBench | 83.9 | Math Olympiad |
| SWE-Bench Verified | 75.3 | Software Engineering |
| Arena-Hard v2 | 90.2 | General Chat |
| IFBench | 70.9 | Instruction Following |
| BFCL-V4 | 67.7 | Function Calling |
| Tau2-Bench | 82.1 | Agent Tool Use |
| SuperGPQA | 65.1 | Graduate-level |
With Test-Time Scaling (Heavy Mode)
| Benchmark | Without TTS | With TTS | Gain |
|---|---|---|---|
| AIME 2025 | 81.6 | 100.0 | +18.4 |
| HMMT | 98.0 | 100.0 | +2.0 |
| GPQA Diamond | 87.4 | 92.8 | +5.4 |
| IMO-AnswerBench | 83.9 | 91.5 | +7.6 |
| LiveCodeBench v6 | 85.9 | 91.4 | +5.5 |
| HLE | 30.2 | 36.5 | +6.3 |
| HLE (with search) | 49.8 | 58.3 | +8.5 |
Test-time scaling consistently adds 2–18 points across all benchmarks tested.
Head-to-Head: Qwen3-Max-Thinking vs Frontier Models
Science & Reasoning
| Benchmark | Qwen3-Max (TTS) | GPT-5.2 | Gemini 3 Pro | Claude Opus 4.5 | DeepSeek V3.2 |
|---|---|---|---|---|---|
| GPQA Diamond | 92.8 | 92.4 | 91.9 | 87.0 | 82.4 |
| HLE (no search) | 36.5 | 35.5 | 37.5 | 30.8 | 25.1 |
| HLE (with search) | 58.3 | 45.5 | 45.0 | 43.2 | 40.8 |
Mathematics
| Benchmark | Qwen3-Max (TTS) | GPT-5.2 | Gemini 3 Pro | Claude Opus 4.5 | DeepSeek V3.2 |
|---|---|---|---|---|---|
| IMO-AnswerBench | 91.5 | 86.3 | 83.3 | 84.0 | 78.3 |
| HMMT Feb 25 | 98.0 | — | 97.5 | — | 92.5 |
| AIME 2025 | 100.0 | — | — | — | — |
Coding & Software Engineering
| Benchmark | Qwen3-Max (TTS) | GPT-5.2 | Gemini 3 Pro | Claude Opus 4.5 | DeepSeek V3.2 |
|---|---|---|---|---|---|
| LiveCodeBench v6 | 91.4 | 87.7 | 90.7 | 84.8 | 80.8 |
| SWE-Bench Verified | 75.3 | 80.0 | 76.2 | 80.9 | 73.1 |
Agents & General Quality
| Benchmark | Qwen3-Max (TTS) | GPT-5.2 | Gemini 3 Pro | Claude Opus 4.5 | DeepSeek V3.2 |
|---|---|---|---|---|---|
| Arena-Hard v2 | 90.2 | — | — | 76.7 | — |
| Tau2-Bench | 82.1 | 80.9 | 85.4 | 85.7 | 80.3 |
Where Qwen3-Max Leads — and Where It Doesn't
Clear Advantages
Notable Gaps
- SWE-Bench Verified (75.3): Real-world software engineering trails Claude Opus 4.5 (80.9) and GPT-5.2 (80.0). Competitive coding skill doesn't fully translate to multi-file debugging.
- Tau2-Bench (82.1): Agent tool use falls behind Claude Opus 4.5 (85.7) and Gemini 3 Pro (85.4).
- Speed (~38 tokens/s): One of the slower frontier models. The test-time scaling mode adds further latency.
- HLE without search (36.5): Slightly behind Gemini 3 Pro (37.5) when tools aren't available.
API Access
Qwen3-Max-Thinking is available through Alibaba Cloud's DashScope / Model Studio service with OpenAI-compatible API endpoints. It's also available on third-party platforms:
- DashScope (Official) — Base URL:
https://dashscope-intl.aliyuncs.com/compatible-mode/v1 - OpenRouter — Available as
qwen/qwen3-max - Novita AI — Lower pricing option
Quick Start (Python)
from openai import OpenAI
client = OpenAI(
base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
api_key="your-dashscope-api-key"
)
response = client.chat.completions.create(
model="qwen3-max-2026-01-23",
messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
extra_body={"enable_thinking": True}
)
print(response.choices[0].message.content)
Pricing
| Provider | Input | Output | Cache Read |
|---|---|---|---|
| DashScope (≤128K) | $1.20 / 1M tokens | $6.00 / 1M tokens | $0.24 / 1M |
| DashScope (>128K) | $3.00 / 1M tokens | $15.00 / 1M tokens | $0.60 / 1M |
| Novita AI | $0.50 / 1M tokens | $5.00 / 1M tokens | — |
Pricing as of January 2026. Third-party rates may vary.
For cost comparison: Qwen3-Max-Thinking is significantly cheaper than GPT-5 ($15/$60 per 1M input/output) and comparable to Claude Opus ($15/$75). The Novita AI option at $0.50 input makes it one of the most affordable frontier-class models available.
Best Use Cases
Limitations
- Closed source: Unlike the rest of the Qwen 3 family, Qwen3-Max is proprietary. You cannot download, inspect, fine-tune, or self-host the weights.
- Speed: At ~38 tokens/second, it's significantly slower than lighter models. Test-time scaling adds further latency for reasoning-heavy queries.
- Real-world coding: SWE-Bench Verified at 75.3 means it trails top competitors on multi-file debugging and real software engineering tasks. For dedicated coding, consider Qwen3-Coder-Next.
- Agent tool use: Tau2-Bench at 82.1 is strong but not best-in-class. Claude and Gemini currently handle complex tool chains more reliably.
- Vendor lock-in: API-only access means you're dependent on Alibaba Cloud's infrastructure, pricing decisions, and uptime.
- Text-only: Qwen3-Max-Thinking handles text only — no image, audio, or video inputs. For multimodal tasks, look at Qwen3-Omni or Qwen3-VL.
Qwen3-Max Timeline
| September 5, 2025 | Qwen3-Max launched — 1T+ MoE, API-only, ~1430 LMArena ELO |
| November 2025 | Thinking mode added to Qwen3-Max |
| January 27, 2026 | Qwen3-Max-Thinking with test-time scaling — perfect AIME 2025, #1 HLE with search |
Frequently Asked Questions
Is Qwen3-Max open source?
No. Qwen3-Max is the only proprietary model in the Qwen 3 family. It's available exclusively through API. All other Qwen 3 models (0.6B through 235B, Coder, ASR, TTS, etc.) are open-weight under Apache 2.0.
Can I run Qwen3-Max locally?
No — the model weights are not publicly available. For the most powerful self-hostable option, use Qwen3-235B-A22B-Thinking-2507, which is open-source and achieves strong benchmark results.
What's the difference between Qwen3-Max and Qwen3-Max-Thinking?
They're the same model. Qwen3-Max refers to the base model (September 2025). Qwen3-Max-Thinking refers to the January 2026 upgrade that added test-time scaling for deeper reasoning. The API model ID qwen3-max-2026-01-23 includes both modes — toggle with enable_thinking.
How does the pricing compare to GPT-5 and Claude?
Qwen3-Max-Thinking at $1.20/$6.00 per million tokens (input/output) is significantly cheaper than GPT-5 and Claude Opus for comparable benchmark performance. Third-party providers like Novita AI offer even lower rates at $0.50/$5.00.
Is it good for coding?
Mixed. It leads LiveCodeBench v6 (91.4) for competitive coding, but trails on SWE-Bench Verified (75.3) for real-world software engineering. For dedicated coding tasks, Qwen3-Coder-Next is a better specialized choice.