Qwen3-Max-Thinking

Qwen3-Max-Thinking is Alibaba Cloud's most powerful AI model — a 1+ trillion parameter Mixture-of-Experts system that tops multiple frontier benchmarks. Launched as Qwen3-Max in September 2025 and upgraded with test-time scaling (TTS) in January 2026, it scores a perfect 100% on AIME 2025, leads Humanity's Last Exam with search, and beats GPT-5.2 on math olympiad tasks. The model is API-only (proprietary), accessible through Alibaba Cloud's DashScope with OpenAI-compatible endpoints. For the complete Qwen 3 ecosystem, see the Qwen 3 overview.

What sets Qwen3-Max-Thinking apart from other frontier models isn't just raw scale — it's how it uses that scale at inference time. The test-time scaling mechanism doesn't rely on naive best-of-N sampling. Instead, it employs an experience-cumulative, multi-round reasoning strategy that progressively refines answers across multiple internal passes. The result: state-of-the-art scores on the hardest reasoning benchmarks available, including several where it outperforms both GPT-5.2 and Gemini 3 Pro.

Key Specifications

Developer Alibaba Cloud — Qwen Team
Model Name Qwen3-Max-Thinking (snapshot: qwen3-max-2026-01-23)
Parameters 1+ trillion (Mixture-of-Experts)
Architecture MoE with global-batch load-balancing loss
Context Window 256,000 tokens (up to 1M referenced)
Max Output Tokens 131,072 tokens
Languages 100+
Pre-training Data 36 trillion tokens
Training Method Two-stage: fine-tuning + reinforcement learning
Thinking Mode Toggle via enable_thinking API parameter
Output Speed ~38 tokens/second
License Proprietary (API-only, closed source)
API Compatibility OpenAI-compatible + Anthropic-compatible
Initial Release September 5, 2025 (Qwen3-Max)
TTS Upgrade January 27, 2026 (Qwen3-Max-Thinking)

Test-Time Scaling: How It Works

Test-time scaling (TTS) is the key innovation that elevates Qwen3-Max from a strong model to a benchmark leader. Unlike standard inference where the model generates one answer, TTS allows the model to trade compute for intelligence at inference time.

The mechanism works through what Alibaba describes as an experience-cumulative, multi-round reasoning strategy:

The practical effect is dramatic. On AIME 2025 (American Invitational Mathematics Examination), standard mode scores 81.6 — but with test-time scaling enabled, the score jumps to a perfect 100%. Similar gains appear across GPQA Diamond (+5.4 points), LiveCodeBench (+5.5), and HLE with search (+8.5).

Dual Mode Operation

Qwen3-Max-Thinking operates in two modes, controlled by the enable_thinking API parameter:

Complete Benchmark Results

Standard Mode (Without Test-Time Scaling)

Benchmark Score Category
GPQA Diamond87.4PhD-level Science
MMLU-Pro85.7Knowledge
MMLU-Redux92.8Knowledge
C-Eval93.7Chinese Knowledge
HLE30.2Extreme Difficulty
HLE (with search)49.8Agentic Reasoning
LiveCodeBench v685.9Coding
HMMT Feb 2598.0Math Competition
HMMT Nov 2594.7Math Competition
IMO-AnswerBench83.9Math Olympiad
SWE-Bench Verified75.3Software Engineering
Arena-Hard v290.2General Chat
IFBench70.9Instruction Following
BFCL-V467.7Function Calling
Tau2-Bench82.1Agent Tool Use
SuperGPQA65.1Graduate-level

With Test-Time Scaling (Heavy Mode)

Benchmark Without TTS With TTS Gain
AIME 202581.6100.0+18.4
HMMT98.0100.0+2.0
GPQA Diamond87.492.8+5.4
IMO-AnswerBench83.991.5+7.6
LiveCodeBench v685.991.4+5.5
HLE30.236.5+6.3
HLE (with search)49.858.3+8.5

Test-time scaling consistently adds 2–18 points across all benchmarks tested.

Head-to-Head: Qwen3-Max-Thinking vs Frontier Models

Science & Reasoning

BenchmarkQwen3-Max (TTS)GPT-5.2Gemini 3 ProClaude Opus 4.5DeepSeek V3.2
GPQA Diamond92.892.491.987.082.4
HLE (no search)36.535.537.530.825.1
HLE (with search)58.345.545.043.240.8

Mathematics

BenchmarkQwen3-Max (TTS)GPT-5.2Gemini 3 ProClaude Opus 4.5DeepSeek V3.2
IMO-AnswerBench91.586.383.384.078.3
HMMT Feb 2598.097.592.5
AIME 2025100.0

Coding & Software Engineering

BenchmarkQwen3-Max (TTS)GPT-5.2Gemini 3 ProClaude Opus 4.5DeepSeek V3.2
LiveCodeBench v691.487.790.784.880.8
SWE-Bench Verified75.380.076.280.973.1

Agents & General Quality

BenchmarkQwen3-Max (TTS)GPT-5.2Gemini 3 ProClaude Opus 4.5DeepSeek V3.2
Arena-Hard v290.276.7
Tau2-Bench82.180.985.485.780.3

Where Qwen3-Max Leads — and Where It Doesn't

Clear Advantages

Math Olympiad dominance — #1 on IMO-AnswerBench (91.5), perfect AIME 2025, and near-perfect HMMT (98.0/100.0)
HLE with search — 58.3 on Humanity's Last Exam when tools are available, 13 points ahead of GPT-5.2
Competitive coding — 91.4 on LiveCodeBench v6, the highest among all models tested
General chat quality — 90.2 on Arena-Hard v2 and ~1430 ELO on the LMArena leaderboard

Notable Gaps

API Access

Qwen3-Max-Thinking is available through Alibaba Cloud's DashScope / Model Studio service with OpenAI-compatible API endpoints. It's also available on third-party platforms:

Quick Start (Python)

from openai import OpenAI

client = OpenAI(
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
    api_key="your-dashscope-api-key"
)

response = client.chat.completions.create(
    model="qwen3-max-2026-01-23",
    messages=[{"role": "user", "content": "Prove that sqrt(2) is irrational."}],
    extra_body={"enable_thinking": True}
)
print(response.choices[0].message.content)

Pricing

ProviderInputOutputCache Read
DashScope (≤128K) $1.20 / 1M tokens $6.00 / 1M tokens $0.24 / 1M
DashScope (>128K) $3.00 / 1M tokens $15.00 / 1M tokens $0.60 / 1M
Novita AI $0.50 / 1M tokens $5.00 / 1M tokens

Pricing as of January 2026. Third-party rates may vary.

For cost comparison: Qwen3-Max-Thinking is significantly cheaper than GPT-5 ($15/$60 per 1M input/output) and comparable to Claude Opus ($15/$75). The Novita AI option at $0.50 input makes it one of the most affordable frontier-class models available.

Best Use Cases

Math & STEM Research — Perfect AIME scores and #1 IMO performance make it the top choice for mathematical reasoning and scientific problem-solving.
Competitive Coding — 91.4 on LiveCodeBench v6 with test-time scaling. Ideal for algorithm contests and complex code generation.
Deep Research — The 58.3 HLE-with-search score (far ahead of competitors) shows exceptional ability to combine reasoning with tool use for research tasks.
Knowledge-Intensive QA — 92.8 MMLU-Redux and 93.7 C-Eval demonstrate broad knowledge across domains and languages.
Complex Multi-Step Agents — Native tool calling + test-time scaling enable agents that reason deeply before acting.
High-Quality Chat — 90.2 Arena-Hard v2 and ~1430 LMArena ELO for premium conversational applications.

Limitations

Qwen3-Max Timeline

September 5, 2025Qwen3-Max launched — 1T+ MoE, API-only, ~1430 LMArena ELO
November 2025Thinking mode added to Qwen3-Max
January 27, 2026Qwen3-Max-Thinking with test-time scaling — perfect AIME 2025, #1 HLE with search

Frequently Asked Questions

Is Qwen3-Max open source?

No. Qwen3-Max is the only proprietary model in the Qwen 3 family. It's available exclusively through API. All other Qwen 3 models (0.6B through 235B, Coder, ASR, TTS, etc.) are open-weight under Apache 2.0.

Can I run Qwen3-Max locally?

No — the model weights are not publicly available. For the most powerful self-hostable option, use Qwen3-235B-A22B-Thinking-2507, which is open-source and achieves strong benchmark results.

What's the difference between Qwen3-Max and Qwen3-Max-Thinking?

They're the same model. Qwen3-Max refers to the base model (September 2025). Qwen3-Max-Thinking refers to the January 2026 upgrade that added test-time scaling for deeper reasoning. The API model ID qwen3-max-2026-01-23 includes both modes — toggle with enable_thinking.

How does the pricing compare to GPT-5 and Claude?

Qwen3-Max-Thinking at $1.20/$6.00 per million tokens (input/output) is significantly cheaper than GPT-5 and Claude Opus for comparable benchmark performance. Third-party providers like Novita AI offer even lower rates at $0.50/$5.00.

Is it good for coding?

Mixed. It leads LiveCodeBench v6 (91.4) for competitive coding, but trails on SWE-Bench Verified (75.3) for real-world software engineering. For dedicated coding tasks, Qwen3-Coder-Next is a better specialized choice.