Qwen 3: The Complete Guide

πŸš€ Looking for the latest? Qwen 3.5 (February 2026) is now Alibaba's newest flagship β€” a 397B MoE unified vision-language model with 17B active parameters that surpasses Qwen 3 on most benchmarks. Read the full Qwen 3.5 guide β†’

Qwen 3 is Alibaba Cloud's most ambitious open-source AI initiative to date. Launched in April 2025 and continuously expanded ever since, the Qwen 3 family now spans text LLMs, coding agents, vision-language models, speech recognition, voice synthesis, embeddings, and a trillion-parameter cloud flagship. Every open-weight variant ships under the permissive Apache 2.0 license, trained on 36 trillion tokens across 119 languages β€” and competes head-to-head with GPT-5, Gemini 3, and Claude Opus. Explore the full lineup on the Qwen AI homepage.

From a 0.6B-parameter edge model that fits on a Raspberry Pi to the 1-trillion-parameter Qwen3-Max-Thinking that tops the HLE leaderboard, the Qwen 3 ecosystem covers every deployment scenario. Stand-out features include a Hybrid Reasoning Engine, sparsely-activated MoE architectures, context windows up to 1 million tokens, and native tool calling. Below you'll find the complete model catalog, benchmarks, local deployment guides, and API pricing.

Alibaba Cloud Qwen 3 model family overview β€” dense and MoE variants

The Qwen 3 Ecosystem

What started as eight text models in April 2025 has evolved into a full-stack AI platform. Here's every sub-family at a glance:

Sub-family Purpose Params License Released
Qwen3 (Base LLMs) General text: chat, reasoning, agents 0.6B β†’ 235B Apache 2.0 Apr 2025
Qwen3-2507 Updated Instruct & Thinking splits 4B, 30B, 235B Apache 2.0 Jul 2025
Qwen3-Max / Max-Thinking Closed-source flagship with test-time scaling 1T+ (MoE) Proprietary Sep 2025 / Jan 2026
Qwen3-Coder / Coder-Next Agentic coding with tool calling 30B–480B Apache 2.0 Jul 2025 / Feb 2026
Qwen3-VL Vision-Language understanding 2B β†’ 32B Apache 2.0 2025
Qwen3-Omni Multimodal: text + image + audio + video β€” Apache 2.0 Sep 2025
Qwen3-ASR Speech recognition (52 languages) 0.6B / 8B Apache 2.0 2025
Qwen3-TTS Text-to-speech & voice cloning 0.6B Apache 2.0 2025
Qwen3-Embedding Text embeddings & reranking 0.6B / 8B Apache 2.0 Jun 2025
Qwen3-Next Next-gen ultra-efficient architecture β€” Apache 2.0 Sep 2025

This page focuses on the core text LLMs (base models, 2507 update, and Qwen3-Max). For specialized models, follow the links above to their dedicated guides.

Base LLM Line-up: Dense and MoE

The original April 2025 launch introduced six dense models (0.6B to 32B) and two Mixture-of-Experts variants (30B-A3B and 235B-A22B). All share the same tokenizer, instruction format, and hybrid thinking/non-thinking capability β€” meaning you can swap model sizes without rewriting your code.

Dense Models (0.6B β†’ 32B)

Ideal for chatbots, real-time RAG pipelines, and edge deployment. The 8B variant runs comfortably under 10 GB VRAM with GGUF quantization while matching much larger competitors on multilingual benchmarks.

Mixture-of-Experts (MoE) Variants

Specification Matrix (April 2025 Launch)

ModelTypeTotal ParamsActive Params Native CtxExtended Ctx
Qwen3-0.6BDense0.6Bβ€”32Kβ€”
Qwen3-1.7BDense1.7Bβ€”32Kβ€”
Qwen3-4BDense4Bβ€”32K128K (YaRN)
Qwen3-8BDense8.2Bβ€”32K128K (YaRN)
Qwen3-14BDense14Bβ€”32K128K (YaRN)
Qwen3-32BDense32.8Bβ€”32K128K (YaRN)
Qwen3-30B-A3BMoE30.5B~3.3B32K128K (YaRN)
Qwen3-235B-A22BMoE235B~22B32K128K (YaRN)

April 2025 launch specifications. See the 2507 update below for revised context lengths and dedicated variants.

The Qwen3-2507 Update (July 2025)

Three months after the initial launch, the Qwen team released a major revision that changed the reasoning architecture philosophy. Instead of a single hybrid model that switches between thinking and non-thinking modes, the 2507 update introduced dedicated variants:

2507 Variant Matrix

ModelTypeModeNative CtxExtended Ctx
Qwen3-4B-Instruct-2507DenseNon-thinking256Kβ€”
Qwen3-4B-Thinking-2507DenseThinking256Kβ€”
Qwen3-30B-A3B-Instruct-2507MoENon-thinking256K~1M
Qwen3-30B-A3B-Thinking-2507MoEThinking256K~1M
Qwen3-235B-A22B-Instruct-2507MoENon-thinking256K~1M
Qwen3-235B-A22B-Thinking-2507MoEThinking256K~1M

Key Improvements in 2507

Qwen3-Max & Qwen3-Max-Thinking

Qwen3-Max is Alibaba's closed-source flagship β€” a 1+ trillion parameter MoE model available exclusively through API. Launched in September 2025, it received a major upgrade in January 2026 with the addition of test-time scaling (TTS) under the name Qwen3-Max-Thinking.

The TTS mechanism isn't naive best-of-N sampling β€” it uses an experience-cumulative, multi-round reasoning strategy that progressively refines answers across multiple internal passes. This pushes several benchmarks to state-of-the-art levels.

What Qwen3-Max-Thinking Leads On

#1 HLE with search β€” 58.3 (beats GPT-5.2 at 45.5 and Gemini 3 Pro at 45.0)
#1 IMO-AnswerBench β€” 91.5 (beats GPT-5.2 at 86.3)
#1 LiveCodeBench v6 β€” 91.4 (beats Gemini 3 Pro at 90.7)
#1 Arena-Hard v2 β€” 90.2 (general chat quality)
Perfect 100% on AIME 2025 and HMMT (with TTS)
GPQA Diamond β€” 92.8 (tied with GPT-5.2 at 92.4)

Qwen3-Max Specifications

Parameters1+ trillion (MoE)
Context Window256K tokens (up to 1M referenced)
LicenseProprietary (API-only)
Thinking ModeToggle via enable_thinking parameter
Speed~38 tokens/second
Languages100+
API CompatibilityOpenAI-compatible + Anthropic-compatible

Benchmark Comparison: Qwen3-Max-Thinking vs Frontier Models

Benchmark Qwen3-Max (TTS) GPT-5.2 Gemini 3 Pro Claude Opus 4.5 DeepSeek V3.2
GPQA Diamond92.892.491.987.082.4
HLE (with search)58.345.545.043.240.8
IMO-AnswerBench91.586.383.384.078.3
LiveCodeBench v691.487.790.784.880.8
SWE-Bench Verified75.380.076.280.973.1
Arena-Hard v290.2β€”β€”76.7β€”
MMLU-Pro85.7β€”β€”β€”β€”

Qwen3-Max-Thinking scores with test-time scaling enabled. January 2026 snapshot.

Architecture Deep-Dive

All Qwen 3 text models share a Transformer backbone with several key innovations:

The MoE variants use a 128-expert grid (8 active per token on the 235B model) with a global-batch load-balancing loss to prevent expert collapse. This sparse routing means only ~22B parameters fire on each forward pass, cutting FLOPs by ~6Γ— compared to a dense 235B model.

The later Qwen3-Next (September 2025) introduced a hybrid attention mechanism combining linear attention with standard self-attention for even greater efficiency, signaling the direction of future Qwen 3 releases.

The Hybrid Reasoning Engine

One of Qwen 3's signature innovations is its dual-mode reasoning system β€” the ability to switch between deep chain-of-thought reasoning and instant direct responses:

In the original April 2025 release, both modes lived in a single hybrid model controlled by the enable_thinking parameter. The 2507 update split this into dedicated Instruct (non-thinking) and Thinking variants β€” each optimized independently for its mode, resulting in significantly better performance on both sides.

Training & Alignment

The Qwen 3 training pipeline has three main pre-training stages followed by multi-step post-training:

Pre-training

Post-training

Benchmarks & Performance

Here's how the key Qwen 3 open-source models perform across major benchmarks, compared to competitors:

Qwen3-235B-A22B-Thinking-2507 (Open-Source Flagship)

Benchmark Qwen3-235B-T-2507 Category
AIME2592.3Math Competition
HMMT2583.9Math Competition
LiveCodeBench v674.1Coding
Arena-Hard v279.7General Chat
MMLU-Pro84.4Knowledge
GPQA Diamond81.1PhD-level Science
IFEval87.8Instruction Following
BFCL-v371.9Function Calling

The Thinking-2507 variant beats O4-mini on LiveCodeBench and rivals it on AIME25 (92.3 vs 92.7).

April 2025 Launch Benchmarks (Original Models)

Benchmark Qwen3-235B (thinking) Qwen3-30B-A3B Category
Arena-Hard95.691.0General Chat
AIME'2485.7β€”Math
LiveCodeBench70.7β€”Coding
BFCL70.8β€”Tool Use

Run Qwen 3 Locally

Every open-weight Qwen 3 model is available on Hugging Face and ModelScope in multiple formats. Here are the main deployment options:

Ollama (Easiest)

ollama run qwen3:8b

That's it β€” Ollama automatically downloads the quantized model and starts an interactive chat. Other popular tags:

vLLM (Production Serving)

vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 --tensor-parallel-size 4

vLLM provides OpenAI-compatible API endpoints with PagedAttention for maximum throughput. Recommended for multi-user deployments.

llama.cpp (CPU + GPU Hybrid)

./llama-server -m qwen3-8b-q4_K_M.gguf -c 32768 -ngl 35

GGUF quantized models run with partial GPU offloading, making Qwen 3 accessible even on machines with limited VRAM.

Transformers (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

messages = [{"role": "user", "content": "Explain quantum entanglement simply."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))

API Access & Pricing

Alibaba Cloud offers Qwen 3 models through DashScope / Model Studio with OpenAI-compatible endpoints. The flagship Qwen3-Max-Thinking pricing:

TierInputOutputCache Read
Standard (≀128K ctx)$1.20 / 1M tokens$6.00 / 1M tokens$0.24 / 1M
Long context (>128K)$3.00 / 1M tokens$15.00 / 1M tokens$0.60 / 1M

The open-source models can also be served through third-party providers like OpenRouter, Novita AI, and Fireworks AI β€” often at lower per-token prices than self-hosting.

Base URL (International): https://dashscope-intl.aliyuncs.com/compatible-mode/v1

Hardware Requirements

Model SizeMin VRAM (Q4)Recommended GPUNotes
0.6B – 1.7B2–4 GBAny GPU or CPU-onlyEdge devices, Raspberry Pi, Jetson Nano
4B4–6 GBRTX 3060 / M1 MacGood for local prototyping
8B6–10 GBRTX 3070/4060Sweet spot: quality vs. speed
14B10–16 GBRTX 4070 Ti / A4000Strong all-rounder
32B20–24 GBRTX 4090 / A6000Near-frontier on consumer hardware
30B-A3B (MoE)20–24 GBRTX 4090Only 3B active β€” fast inference
235B-A22B (MoE)80+ GB2–4Γ— A100/H100vLLM + tensor parallelism recommended

Tip: GGUF quantization (Q4_K_M) reduces VRAM usage by 70–80% with minimal quality loss. For the 235B MoE, AWQ and GPTQ quantized builds on Hugging Face make single-node deployment feasible on 2Γ— A100 80GB.

Fine-Tuning Qwen 3

Thanks to the Apache 2.0 license, you can fine-tune any open-weight Qwen 3 model on your own data. Common approaches:

Recommended starting point: QLoRA on Qwen3-8B or Qwen3-14B with Unsloth for 2–4Γ— speedup and 60% less memory.

Use Cases

Enterprise RAG β€” Combine 256K+ context with vector search for instant SOP and knowledge-base answers without chunking.
Autonomous Agents β€” Native tool calling (MCP) makes multi-step orchestration straightforward. The model can call APIs, query databases, and chain actions.
Agentic Coding β€” Qwen3-Coder-Next (80B/3B active) autonomously debugs, tests, and fixes real-world software.
Multilingual CX β€” 119-language fluency for global customer support, localisation, and content translation.
Long-Form Analysis β€” Ingest 500-page PDFs, legal contracts, or full codebases without splitting into chunks.
Speech & Voice β€” Qwen3-ASR transcribes 52 languages; Qwen3-TTS clones voices from 3 seconds of audio.
Edge / IoT β€” 0.6B and 1.7B dense models run on Jetson Orin, Raspberry Pi, and mobile devices for low-latency control.
Research & STEM β€” Qwen3-Max-Thinking scores 100% on AIME 2025 and leads HLE with search β€” ideal for scientific reasoning tasks.

Limitations & Considerations

While Qwen 3 is impressive, it's important to be aware of its boundaries:

Frequently Asked Questions

Is Qwen 3 free to use commercially?

Yes β€” all open-weight models (0.6B through 235B) are released under the Apache 2.0 license, which permits unrestricted commercial use, modification, and distribution. Only Qwen3-Max is proprietary (API-only).

What's the best Qwen 3 model for my hardware?

For 8 GB VRAM: Qwen3-4B or Qwen3-8B (Q4 quantized). For 24 GB VRAM: Qwen3-32B or Qwen3-30B-A3B (MoE). For cloud/multi-GPU: Qwen3-235B-A22B-Thinking-2507.

Should I use the original models or the 2507 update?

If a 2507 variant exists for your target size (4B, 30B-A3B, or 235B-A22B), always prefer the 2507 version. It offers dramatically better performance, larger context windows, and optimized reasoning. For sizes without a 2507 update (0.6B, 1.7B, 8B, 14B, 32B), the original models remain the latest available.

How does Qwen 3 compare to Llama, GPT, and Gemini?

The open-source Qwen3-235B-Thinking-2507 competes directly with O4-mini and Gemini 2.5 Pro on math and coding benchmarks. The closed-source Qwen3-Max-Thinking trades blows with GPT-5.2 and Gemini 3 Pro, leading on several benchmarks (HLE, IMO, LiveCodeBench) while trailing on others (SWE-Bench).

Can I use Qwen 3 for tool calling and agents?

Yes β€” Qwen 3 supports native JSON-defined tool calling compatible with the Model Context Protocol (MCP). Define your tools as JSON schemas and the model will generate structured function calls automatically. No custom parsing needed.

{
  "name": "lookupWeather",
  "description": "Get current weather for a city",
  "parameters": {
    "type": "object",
    "properties": { "city": { "type": "string" } },
    "required": ["city"]
  }
}

Conclusion: The Most Complete Open AI Ecosystem

Qwen 3 isn't just a language model β€” it's a complete AI platform. From edge-ready 0.6B models to the trillion-parameter Qwen3-Max-Thinking that leads frontier benchmarks, from agentic coding with Qwen3-Coder-Next to voice cloning with Qwen3-TTS and 52-language transcription with Qwen3-ASR β€” Alibaba Cloud has built an ecosystem where every component is either open-source and free to fine-tune, or competitively priced via API.

Whether you're building an autonomous agent, deploying a multilingual support bot, running a local code assistant, or pushing the limits of mathematical reasoning, Qwen 3 has a model for your use case. Fork it on GitHub, download weights from Hugging Face, or start chatting right now at chat.qwen.ai. Explore the full ecosystem on the Qwen AI homepage.