Qwen 3: The Complete Guide
Qwen 3 is Alibaba Cloud's most ambitious open-source AI initiative to date. Launched in April 2025 and continuously expanded ever since, the Qwen 3 family now spans text LLMs, coding agents, vision-language models, speech recognition, voice synthesis, embeddings, and a trillion-parameter cloud flagship. Every open-weight variant ships under the permissive Apache 2.0 license, trained on 36 trillion tokens across 119 languages β and competes head-to-head with GPT-5, Gemini 3, and Claude Opus. Explore the full lineup on the Qwen AI homepage.
From a 0.6B-parameter edge model that fits on a Raspberry Pi to the 1-trillion-parameter Qwen3-Max-Thinking that tops the HLE leaderboard, the Qwen 3 ecosystem covers every deployment scenario. Stand-out features include a Hybrid Reasoning Engine, sparsely-activated MoE architectures, context windows up to 1 million tokens, and native tool calling. Below you'll find the complete model catalog, benchmarks, local deployment guides, and API pricing.
In This Guide
The Qwen 3 Ecosystem
What started as eight text models in April 2025 has evolved into a full-stack AI platform. Here's every sub-family at a glance:
| Sub-family | Purpose | Params | License | Released |
|---|---|---|---|---|
| Qwen3 (Base LLMs) | General text: chat, reasoning, agents | 0.6B β 235B | Apache 2.0 | Apr 2025 |
| Qwen3-2507 | Updated Instruct & Thinking splits | 4B, 30B, 235B | Apache 2.0 | Jul 2025 |
| Qwen3-Max / Max-Thinking | Closed-source flagship with test-time scaling | 1T+ (MoE) | Proprietary | Sep 2025 / Jan 2026 |
| Qwen3-Coder / Coder-Next | Agentic coding with tool calling | 30Bβ480B | Apache 2.0 | Jul 2025 / Feb 2026 |
| Qwen3-VL | Vision-Language understanding | 2B β 32B | Apache 2.0 | 2025 |
| Qwen3-Omni | Multimodal: text + image + audio + video | β | Apache 2.0 | Sep 2025 |
| Qwen3-ASR | Speech recognition (52 languages) | 0.6B / 8B | Apache 2.0 | 2025 |
| Qwen3-TTS | Text-to-speech & voice cloning | 0.6B | Apache 2.0 | 2025 |
| Qwen3-Embedding | Text embeddings & reranking | 0.6B / 8B | Apache 2.0 | Jun 2025 |
| Qwen3-Next | Next-gen ultra-efficient architecture | β | Apache 2.0 | Sep 2025 |
This page focuses on the core text LLMs (base models, 2507 update, and Qwen3-Max). For specialized models, follow the links above to their dedicated guides.
Base LLM Line-up: Dense and MoE
The original April 2025 launch introduced six dense models (0.6B to 32B) and two Mixture-of-Experts variants (30B-A3B and 235B-A22B). All share the same tokenizer, instruction format, and hybrid thinking/non-thinking capability β meaning you can swap model sizes without rewriting your code.
Dense Models (0.6B β 32B)
Ideal for chatbots, real-time RAG pipelines, and edge deployment. The 8B variant runs comfortably under 10 GB VRAM with GGUF quantization while matching much larger competitors on multilingual benchmarks.
Mixture-of-Experts (MoE) Variants
- Qwen3-30B-A3B β 30.5B total parameters, ~3.3B active per token. Runs in real time on a single RTX 4090 and delivers Arena-Hard scores of 91.0 β at 1/8th the GPU cost of comparable dense models.
- Qwen3-235B-A22B β 235B total, ~22B active (8 of 128 experts per token). The open-source flagship that rivals GPT-4-class reasoning while halving inference cost compared to a monolithic dense model of equivalent quality.
Specification Matrix (April 2025 Launch)
| Model | Type | Total Params | Active Params | Native Ctx | Extended Ctx |
|---|---|---|---|---|---|
| Qwen3-0.6B | Dense | 0.6B | β | 32K | β |
| Qwen3-1.7B | Dense | 1.7B | β | 32K | β |
| Qwen3-4B | Dense | 4B | β | 32K | 128K (YaRN) |
| Qwen3-8B | Dense | 8.2B | β | 32K | 128K (YaRN) |
| Qwen3-14B | Dense | 14B | β | 32K | 128K (YaRN) |
| Qwen3-32B | Dense | 32.8B | β | 32K | 128K (YaRN) |
| Qwen3-30B-A3B | MoE | 30.5B | ~3.3B | 32K | 128K (YaRN) |
| Qwen3-235B-A22B | MoE | 235B | ~22B | 32K | 128K (YaRN) |
April 2025 launch specifications. See the 2507 update below for revised context lengths and dedicated variants.
The Qwen3-2507 Update (July 2025)
Three months after the initial launch, the Qwen team released a major revision that changed the reasoning architecture philosophy. Instead of a single hybrid model that switches between thinking and non-thinking modes, the 2507 update introduced dedicated variants:
- Instruct-2507 β Non-thinking only. Optimized for instruction following, chat, and tool use. No
<think>blocks generated. - Thinking-2507 β Thinking only. Always reasons through problems. Optimized for math, STEM, and complex logic.
2507 Variant Matrix
| Model | Type | Mode | Native Ctx | Extended Ctx |
|---|---|---|---|---|
| Qwen3-4B-Instruct-2507 | Dense | Non-thinking | 256K | β |
| Qwen3-4B-Thinking-2507 | Dense | Thinking | 256K | β |
| Qwen3-30B-A3B-Instruct-2507 | MoE | Non-thinking | 256K | ~1M |
| Qwen3-30B-A3B-Thinking-2507 | MoE | Thinking | 256K | ~1M |
| Qwen3-235B-A22B-Instruct-2507 | MoE | Non-thinking | 256K | ~1M |
| Qwen3-235B-A22B-Thinking-2507 | MoE | Thinking | 256K | ~1M |
Key Improvements in 2507
- Context window jump: From 32K native / 128K extended to 256K native and up to ~1 million tokens using DCA + MInference sparse attention. The 235B-Instruct-2507 scores 82.5% accuracy on the RULER benchmark at 1M tokens.
- Massive non-thinking gains: The Instruct-2507 variants saw dramatic improvements β for example, AIME25 jumped from 24.7 to 70.3 and ZebraLogic from 37.7 to 95.0 on the 235B model.
- Better instruction following: Enhanced math, coding, tool use, and multilingual knowledge across all sizes.
- The 4B-Thinking-2507 achieves AIME25 scores of 81.3 β rivaling the much larger Qwen2.5-72B-Instruct.
Qwen3-Max & Qwen3-Max-Thinking
Qwen3-Max is Alibaba's closed-source flagship β a 1+ trillion parameter MoE model available exclusively through API. Launched in September 2025, it received a major upgrade in January 2026 with the addition of test-time scaling (TTS) under the name Qwen3-Max-Thinking.
The TTS mechanism isn't naive best-of-N sampling β it uses an experience-cumulative, multi-round reasoning strategy that progressively refines answers across multiple internal passes. This pushes several benchmarks to state-of-the-art levels.
What Qwen3-Max-Thinking Leads On
Qwen3-Max Specifications
| Parameters | 1+ trillion (MoE) |
| Context Window | 256K tokens (up to 1M referenced) |
| License | Proprietary (API-only) |
| Thinking Mode | Toggle via enable_thinking parameter |
| Speed | ~38 tokens/second |
| Languages | 100+ |
| API Compatibility | OpenAI-compatible + Anthropic-compatible |
Benchmark Comparison: Qwen3-Max-Thinking vs Frontier Models
| Benchmark | Qwen3-Max (TTS) | GPT-5.2 | Gemini 3 Pro | Claude Opus 4.5 | DeepSeek V3.2 |
|---|---|---|---|---|---|
| GPQA Diamond | 92.8 | 92.4 | 91.9 | 87.0 | 82.4 |
| HLE (with search) | 58.3 | 45.5 | 45.0 | 43.2 | 40.8 |
| IMO-AnswerBench | 91.5 | 86.3 | 83.3 | 84.0 | 78.3 |
| LiveCodeBench v6 | 91.4 | 87.7 | 90.7 | 84.8 | 80.8 |
| SWE-Bench Verified | 75.3 | 80.0 | 76.2 | 80.9 | 73.1 |
| Arena-Hard v2 | 90.2 | β | β | 76.7 | β |
| MMLU-Pro | 85.7 | β | β | β | β |
Qwen3-Max-Thinking scores with test-time scaling enabled. January 2026 snapshot.
Architecture Deep-Dive
All Qwen 3 text models share a Transformer backbone with several key innovations:
- Grouped Query Attention (GQA) β Reduces KV-cache memory for faster inference and longer contexts.
- QK-Norm β Stabilizes attention logits during training, especially at large scale.
- SwiGLU Activations β Replaces standard FFN layers for better parameter efficiency.
- RoPE Scaling β Rotary position embeddings extended with YaRN for context beyond the training length.
The MoE variants use a 128-expert grid (8 active per token on the 235B model) with a global-batch load-balancing loss to prevent expert collapse. This sparse routing means only ~22B parameters fire on each forward pass, cutting FLOPs by ~6Γ compared to a dense 235B model.
The later Qwen3-Next (September 2025) introduced a hybrid attention mechanism combining linear attention with standard self-attention for even greater efficiency, signaling the direction of future Qwen 3 releases.
The Hybrid Reasoning Engine
One of Qwen 3's signature innovations is its dual-mode reasoning system β the ability to switch between deep chain-of-thought reasoning and instant direct responses:
- Thinking Mode β The model emits explicit
<think>chains before answering, walking through problems step-by-step. Ideal for math, STEM, multi-hop logic, and coding. - Non-Thinking (Fast) Mode β Bypasses chain-of-thought entirely for low-latency responses. Best for support chat, search, simple QA, and high-throughput scenarios.
- Thinking Budget β You can hard-cap reasoning tokens with
max_thought_tokensto control the latency/accuracy trade-off.
In the original April 2025 release, both modes lived in a single hybrid model controlled by the enable_thinking parameter. The 2507 update split this into dedicated Instruct (non-thinking) and Thinking variants β each optimized independently for its mode, resulting in significantly better performance on both sides.
Training & Alignment
The Qwen 3 training pipeline has three main pre-training stages followed by multi-step post-training:
Pre-training
- Stage 1 (30T tokens): Mixed-quality web and code data at 4K sequence length β builds broad world knowledge.
- Stage 2 (5T tokens): High-quality STEM, coding, and reasoning data β sharpens analytic ability.
- Stage 3 (hundreds of B tokens): 32K sequence long-context bootstrapping β teaches the model to handle extended documents.
Post-training
- Multi-round supervised fine-tuning (SFT) with diverse instruction data
- Reward-model RLHF for alignment with human preferences
- Reasoning-focused reinforcement learning to improve chain-of-thought quality
- On-policy distillation that reduces compute by ~90% compared to pure RL
Benchmarks & Performance
Here's how the key Qwen 3 open-source models perform across major benchmarks, compared to competitors:
Qwen3-235B-A22B-Thinking-2507 (Open-Source Flagship)
| Benchmark | Qwen3-235B-T-2507 | Category |
|---|---|---|
| AIME25 | 92.3 | Math Competition |
| HMMT25 | 83.9 | Math Competition |
| LiveCodeBench v6 | 74.1 | Coding |
| Arena-Hard v2 | 79.7 | General Chat |
| MMLU-Pro | 84.4 | Knowledge |
| GPQA Diamond | 81.1 | PhD-level Science |
| IFEval | 87.8 | Instruction Following |
| BFCL-v3 | 71.9 | Function Calling |
The Thinking-2507 variant beats O4-mini on LiveCodeBench and rivals it on AIME25 (92.3 vs 92.7).
April 2025 Launch Benchmarks (Original Models)
| Benchmark | Qwen3-235B (thinking) | Qwen3-30B-A3B | Category |
|---|---|---|---|
| Arena-Hard | 95.6 | 91.0 | General Chat |
| AIME'24 | 85.7 | β | Math |
| LiveCodeBench | 70.7 | β | Coding |
| BFCL | 70.8 | β | Tool Use |
Run Qwen 3 Locally
Every open-weight Qwen 3 model is available on Hugging Face and ModelScope in multiple formats. Here are the main deployment options:
Ollama (Easiest)
ollama run qwen3:8b
That's it β Ollama automatically downloads the quantized model and starts an interactive chat. Other popular tags:
ollama run qwen3:4bβ Lightweight, runs on 8GB VRAMollama run qwen3:14bβ Sweet spot for most usersollama run qwen3:32bβ Near-frontier on a single GPUollama run qwen3:30b-a3bβ MoE: 30B params, only 3B active
vLLM (Production Serving)
vllm serve Qwen/Qwen3-235B-A22B-Instruct-2507 --tensor-parallel-size 4
vLLM provides OpenAI-compatible API endpoints with PagedAttention for maximum throughput. Recommended for multi-user deployments.
llama.cpp (CPU + GPU Hybrid)
./llama-server -m qwen3-8b-q4_K_M.gguf -c 32768 -ngl 35
GGUF quantized models run with partial GPU offloading, making Qwen 3 accessible even on machines with limited VRAM.
Transformers (Python)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
messages = [{"role": "user", "content": "Explain quantum entanglement simply."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))
API Access & Pricing
Alibaba Cloud offers Qwen 3 models through DashScope / Model Studio with OpenAI-compatible endpoints. The flagship Qwen3-Max-Thinking pricing:
| Tier | Input | Output | Cache Read |
|---|---|---|---|
| Standard (β€128K ctx) | $1.20 / 1M tokens | $6.00 / 1M tokens | $0.24 / 1M |
| Long context (>128K) | $3.00 / 1M tokens | $15.00 / 1M tokens | $0.60 / 1M |
The open-source models can also be served through third-party providers like OpenRouter, Novita AI, and Fireworks AI β often at lower per-token prices than self-hosting.
Base URL (International): https://dashscope-intl.aliyuncs.com/compatible-mode/v1
Hardware Requirements
| Model Size | Min VRAM (Q4) | Recommended GPU | Notes |
|---|---|---|---|
| 0.6B β 1.7B | 2β4 GB | Any GPU or CPU-only | Edge devices, Raspberry Pi, Jetson Nano |
| 4B | 4β6 GB | RTX 3060 / M1 Mac | Good for local prototyping |
| 8B | 6β10 GB | RTX 3070/4060 | Sweet spot: quality vs. speed |
| 14B | 10β16 GB | RTX 4070 Ti / A4000 | Strong all-rounder |
| 32B | 20β24 GB | RTX 4090 / A6000 | Near-frontier on consumer hardware |
| 30B-A3B (MoE) | 20β24 GB | RTX 4090 | Only 3B active β fast inference |
| 235B-A22B (MoE) | 80+ GB | 2β4Γ A100/H100 | vLLM + tensor parallelism recommended |
Tip: GGUF quantization (Q4_K_M) reduces VRAM usage by 70β80% with minimal quality loss. For the 235B MoE, AWQ and GPTQ quantized builds on Hugging Face make single-node deployment feasible on 2Γ A100 80GB.
Fine-Tuning Qwen 3
Thanks to the Apache 2.0 license, you can fine-tune any open-weight Qwen 3 model on your own data. Common approaches:
- LoRA / QLoRA β Low-rank adaptation with 4-bit quantization. Fine-tune the 8B model on a single RTX 4090 with 24GB VRAM. Tools: Unsloth, LLaMA-Factory, or Hugging Face PEFT.
- Full fine-tuning β For maximum control. Requires multi-GPU setups for models above 8B. DeepSpeed ZeRO-3 or FSDP recommended.
- DPO / RLHF β Preference-based alignment to customize model behavior. Supported natively by TRL (Transformer Reinforcement Learning) library.
Recommended starting point: QLoRA on Qwen3-8B or Qwen3-14B with Unsloth for 2β4Γ speedup and 60% less memory.
Use Cases
Limitations & Considerations
While Qwen 3 is impressive, it's important to be aware of its boundaries:
- Hallucination: Like all LLMs, Qwen 3 can generate plausible-sounding but incorrect information. Always verify factual claims in critical applications.
- SWE-Bench gap: On real-world software engineering tasks (SWE-Bench Verified), Qwen3-Max-Thinking scores 75.3 β behind Claude Opus 4.5 (80.9) and GPT-5.2 (80.0).
- Speed vs. intelligence trade-off: Qwen3-Max-Thinking runs at ~38 tokens/second β significantly slower than lighter models. For latency-sensitive applications, consider the 30B-A3B or dense variants.
- Closed-source flagship: Qwen3-Max is API-only and proprietary. The most powerful Qwen model cannot be self-hosted or fine-tuned.
- MoE memory overhead: While MoE models activate fewer parameters, they still require loading all expert weights into memory. The 235B MoE needs ~80+ GB VRAM even quantized.
- Safety alignment: As with any open-weight model, fine-tuned variants may lose safety guardrails. Deploy responsibly with appropriate safeguards.
Frequently Asked Questions
Is Qwen 3 free to use commercially?
Yes β all open-weight models (0.6B through 235B) are released under the Apache 2.0 license, which permits unrestricted commercial use, modification, and distribution. Only Qwen3-Max is proprietary (API-only).
What's the best Qwen 3 model for my hardware?
For 8 GB VRAM: Qwen3-4B or Qwen3-8B (Q4 quantized). For 24 GB VRAM: Qwen3-32B or Qwen3-30B-A3B (MoE). For cloud/multi-GPU: Qwen3-235B-A22B-Thinking-2507.
Should I use the original models or the 2507 update?
If a 2507 variant exists for your target size (4B, 30B-A3B, or 235B-A22B), always prefer the 2507 version. It offers dramatically better performance, larger context windows, and optimized reasoning. For sizes without a 2507 update (0.6B, 1.7B, 8B, 14B, 32B), the original models remain the latest available.
How does Qwen 3 compare to Llama, GPT, and Gemini?
The open-source Qwen3-235B-Thinking-2507 competes directly with O4-mini and Gemini 2.5 Pro on math and coding benchmarks. The closed-source Qwen3-Max-Thinking trades blows with GPT-5.2 and Gemini 3 Pro, leading on several benchmarks (HLE, IMO, LiveCodeBench) while trailing on others (SWE-Bench).
Can I use Qwen 3 for tool calling and agents?
Yes β Qwen 3 supports native JSON-defined tool calling compatible with the Model Context Protocol (MCP). Define your tools as JSON schemas and the model will generate structured function calls automatically. No custom parsing needed.
{
"name": "lookupWeather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": { "city": { "type": "string" } },
"required": ["city"]
}
}
Conclusion: The Most Complete Open AI Ecosystem
Qwen 3 isn't just a language model β it's a complete AI platform. From edge-ready 0.6B models to the trillion-parameter Qwen3-Max-Thinking that leads frontier benchmarks, from agentic coding with Qwen3-Coder-Next to voice cloning with Qwen3-TTS and 52-language transcription with Qwen3-ASR β Alibaba Cloud has built an ecosystem where every component is either open-source and free to fine-tune, or competitively priced via API.
Whether you're building an autonomous agent, deploying a multilingual support bot, running a local code assistant, or pushing the limits of mathematical reasoning, Qwen 3 has a model for your use case. Fork it on GitHub, download weights from Hugging Face, or start chatting right now at chat.qwen.ai. Explore the full ecosystem on the Qwen AI homepage.