Qwen3-ASR: Alibaba's Open-Source Speech Recognition System

Qwen3-ASR is Alibaba Cloud's open-source automatic speech recognition system, released on January 30, 2026. It supports 52 languages and dialects (including 22 Chinese dialects), handles both streaming and offline transcription from a single unified model, and achieves state-of-the-art accuracy across Chinese, multilingual, and even singing voice benchmarks. The entire family is released under the Apache 2.0 license, making it one of the most capable open-source ASR systems available today.

Qwen3-ASR on Hugging Face

GitHub Repository

What sets Qwen3-ASR apart from alternatives like Whisper or GPT-4o Transcribe isn't just raw accuracy — it's the combination of capabilities packed into a single lightweight model. It can transcribe noisy environments, recognize singing voices over background music, provide word-level timestamps via its companion ForcedAligner model, and switch seamlessly between streaming and offline modes. For context on the broader model family, see the Qwen 3 overview.

Qwen3-ASR capabilities overview showing robustness in noisy environments, timestamp prediction, singing voice recognition, multilingual support, and fast streaming inference — Qwen3-ASR capabilities: noise robustness, forced alignment timestamps, singing voice transcription, multilingual code-switching, and real-time streaming.

In This Guide

Model Variants Architecture 52 Languages Benchmarks Key Features Run Locally Cloud API ForcedAligner vs. Whisper Verdict

Model Variants

The Qwen3-ASR family includes three open-source models and one cloud-only API variant:

Model	Parameters	Type	License
Qwen3-ASR-1.7B	~2B (1.7B LLM + encoder + projector)	Best quality, offline & streaming	Apache 2.0
Qwen3-ASR-0.6B	~0.9B (0.6B LLM + encoder + projector)	Faster, lower VRAM	Apache 2.0
Qwen3-ForcedAligner-0.6B	~0.9B	Word-level timestamp alignment	Apache 2.0
Qwen3-ASR-Flash (API only)	Not disclosed	Cloud real-time API via DashScope	Proprietary

The 1.7B variant is the flagship for accuracy, while the 0.6B model offers a compelling trade-off — achieving a real-time factor (RTF) of 0.064, meaning it can transcribe roughly 2,000 seconds of speech per second at high concurrency. Both models share the same architecture and support the same 52 languages.

Architecture: Audio Transformer + Qwen3 LLM

Qwen3-ASR uses a three-component architecture built on top of the Qwen3-Omni foundation:

AuT Encoder — A pretrained Audio Transformer that performs 8x downsampling on Fbank features (128 dimensions), producing a 12.5Hz token rate. The 1.7B variant uses a 300M-parameter encoder with 1024 hidden size; the 0.6B variant uses 180M parameters with 896 hidden size.
Projector — Bridges the audio encoder's representations to the LLM decoder's embedding space.
LLM Decoder — A Qwen3-1.7B or Qwen3-0.6B language model that generates the text output autoregressively.

Qwen3-ASR architecture diagram showing AuT Encoder with downsampling Conv2d and self-attention layers, projector, and Qwen3 LM decoder — Qwen3-ASR architecture: AuT Encoder (left) processes audio into tokens, which the Qwen3 LM decoder (right) converts to text.

A key innovation is the dynamic flash attention window that ranges from 1 to 8 seconds. This allows a single model to handle both streaming (low-latency, real-time) and offline (full-context, higher accuracy) inference without needing separate model weights.

Training Pipeline

The model was trained through a rigorous four-stage pipeline:

AuT Pretraining — Approximately 40 million hours of pseudo-labeled ASR data, primarily Chinese and English.
Omni Pretraining — Multi-task training with 3 trillion tokens covering audio, vision, and text (as part of the Qwen3-Omni foundation).
ASR Supervised Fine-Tuning — Targeted training on the ASR format with multilingual data, streaming enhancements, and context-biasing data.
Reinforcement Learning (GSPO) — Group Sequence Policy Optimization with ~50,000 utterances to refine output quality.

Supported Languages: 52 Total

Qwen3-ASR supports 30 languages and 22 Chinese dialects — by far the most comprehensive coverage among open-source ASR models:

30 Languages: Chinese, English, Cantonese, Arabic, German, French, Spanish, Portuguese, Indonesian, Italian, Korean, Russian, Thai, Vietnamese, Japanese, Turkish, Hindi, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, Romanian

22 Chinese Dialects: Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Cantonese (HK/Guangdong), Wu, Minnan, and more

The model also handles multilingual code-switching within a single utterance — for example, a speaker mixing English and Chinese in the same sentence without any manual language selection.

Benchmarks & Performance

Qwen3-ASR-1.7B achieves state-of-the-art results across multiple categories, significantly outperforming Whisper-large-v3 and GPT-4o-Transcribe in Chinese and multilingual benchmarks:

English ASR (Word Error Rate — lower is better)

Dataset	Qwen3-ASR-1.7B	Whisper-large-v3	GPT-4o-Transcribe
LibriSpeech (clean/other)	1.63 / 3.38	1.51 / 3.97	1.39 / 3.75
GigaSpeech	8.45	9.76	—
CommonVoice-en	7.39	9.90	—

Chinese ASR (Character Error Rate — lower is better)

Dataset	Qwen3-ASR-1.7B	Whisper-large-v3	GPT-4o-Transcribe
WenetSpeech (net/meeting)	4.97 / 5.88	9.86 / 19.11	15.30 / 32.27
AISHELL-2 test	2.71	—	4.24
SpeechIO	2.88	—	12.86

Singing Voice & Background Music (WER)

Dataset	Qwen3-ASR-1.7B	GPT-4o	Doubao-ASR
M4Singer	5.98	16.77	7.88
EntireSongs-en	14.60	30.71	33.51
EntireSongs-zh	13.91	34.86	23.99

The singing voice results are particularly impressive — Qwen3-ASR achieves sub-6% WER on solo singing and sub-15% even on full songs with strong background music, dramatically outperforming all competitors.

Qwen3-ASR-Flash error rates bar chart comparing performance against Gemini-2.5-Pro, GPT-4o-Transcribe, Paraformer-v2, and Doubao-ASR across Chinese, English, Multilingual, Entities, and Lyrics benchmarks — Qwen3-ASR-Flash error rates vs. Gemini-2.5-Pro, GPT-4o-Transcribe, and others across multiple benchmark categories.

Comprehensive benchmark table showing Qwen3-ASR-1.7B word error rates and character error rates across multilingual, English, Chinese, Chinese dialect, singing, and songs with background music datasets compared to GPT-4o-Transcribe, Gemini-2.5-Pro, Doubao-ASR, and Whisper-large-V3 — Full benchmark results: Qwen3-ASR-1.7B across all test categories vs. GPT-4o-Transcribe, Gemini-2.5-Pro, Doubao-ASR, and Whisper-large-V3.

Key Differentiating Features

Qwen3-ASR-Flash illustrated features: robust speech recognition in complex acoustic environments, accurate singing voice transcription, multilingual code-switching support, and customizable text prompting — Qwen3-ASR-Flash key capabilities: noise robustness, singing transcription, 11+ languages with code-switching, and prompt-based customization.

1. Unified Streaming & Offline Model

Most ASR models force you to choose between a streaming model (low latency, lower accuracy) and an offline model (higher accuracy, higher latency). Qwen3-ASR uses a dynamic flash attention window (1–8 seconds) that lets a single set of weights handle both modes. In streaming mode, the 1.7B model maintains a WER of 4.51 on LibriSpeech-other vs. 3.38 in offline — a very small accuracy trade-off for real-time capability.

2. Context Biasing

This is a feature that most open-source ASR models lack entirely. You can provide arbitrary text (up to 10,000 tokens in the API) to bias the transcription toward specific terminology — company names, medical terms, legal jargon, product codes, or any domain-specific vocabulary. The model will prefer these terms when the audio is ambiguous, dramatically improving accuracy in specialized domains.

3. Singing Voice & Music Recognition

Qwen3-ASR can accurately transcribe singing voices even with strong background music — something that destroys the accuracy of virtually every other ASR model. It achieves under 6% WER on solo singing benchmarks and under 15% on full songs with instrumental accompaniment, outperforming GPT-4o by more than 2x.

4. Inverse Text Normalization

The model automatically converts spoken forms to clean written text: numbers, dates, currencies, email addresses, and URLs are formatted properly without any post-processing needed.

How to Run Qwen3-ASR Locally

The Qwen team provides an official Python package that makes deployment straightforward. There are two backends: a basic Transformers backend and a high-performance vLLM backend.

Option 1: Quick Install (Transformers Backend)

The simplest way to get started for testing and light usage.

Install the package:
pip install -U qwen-asr
Launch the web demo:
qwen-asr-demo Qwen/Qwen3-ASR-1.7B
Or use the streaming demo:
qwen-asr-demo-streaming Qwen/Qwen3-ASR-1.7B

Option 2: vLLM Backend (Recommended for Speed)

For production use or when you need high throughput.

Install with vLLM support:
pip install -U qwen-asr[vllm]
Launch the OpenAI-compatible server:
qwen-asr-serve Qwen/Qwen3-ASR-1.7B --port 8000

This exposes an OpenAI-compatible API endpoint, enabling drop-in replacement in existing workflows.

Option 3: Docker

For containerized deployments with all dependencies pre-installed.

docker run --gpus all --shm-size=4gb -p 8000:8000 qwenllm/qwen3-asr:latest

Supported Audio Inputs

Qwen3-ASR accepts a wide variety of input formats:

Local files — MP3, WAV, M4A, MP4, MOV, MKV, and more (via FFmpeg)
HTTP/HTTPS URLs (direct audio links)
Base64-encoded audio data
NumPy arrays with sample rate tuples

All inputs are automatically resampled to 16kHz mono.

Hardware Requirements

Model	Min VRAM	Recommended	RTF (Concurrency 128)
Qwen3-ASR-1.7B	~4 GB (weights only)	16 GB+ for batch inference	—
Qwen3-ASR-0.6B	~2 GB (weights only)	8 GB+ for batch inference	0.064

For more details on running Qwen models locally, check our guide to running Qwen locally and our hardware requirements page.

Cloud API: Qwen3-ASR-Flash

For users who don't want to self-host, Alibaba offers Qwen3-ASR-Flash as a real-time WebSocket API through the DashScope platform. Key details:

Protocol	WebSocket (WSS)
Endpoint (International)	`wss://dashscope-intl.aliyuncs.com/api-ws/v1/realtime`
Pricing (International)	$0.00009/second (~$0.32/hour)
Rate Limit	20 requests per second
Audio Formats	PCM, Opus (8 kHz or 16 kHz, mono)
Features	Real-time streaming, emotion recognition, context biasing

The API-only variant also supports features not yet available in the open-source models, including emotion recognition in real-time.

Qwen3-ASR-Toolkit (CLI for Long Audio)

The official Qwen3-ASR-Toolkit provides a command-line tool that overcomes the API's 3-minute audio limit through intelligent chunking:

pip install qwen3-asr-toolkit
qwen3-asr -i audio_file.mp3 -j 4 --save-srt

It handles automatic resampling, Voice Activity Detection (VAD) for smart splitting, parallel processing across multiple threads, SRT subtitle generation, and hallucination removal post-processing.

Qwen3-ForcedAligner: Millisecond-Precision Timestamps

The companion Qwen3-ForcedAligner-0.6B model provides word and character-level timestamps with exceptional precision. It uses a non-autoregressive (NAR) architecture with a "slot-filling" approach to timestamp prediction.

Language	Qwen3-ForcedAligner	WhisperX	NFA
Chinese	33.1 ms	—	109.8 ms
English	37.5 ms	92.1 ms	107.5 ms
French	41.7 ms	145.3 ms	100.7 ms
German	46.5 ms	165.1 ms	122.7 ms
Japanese	42.2 ms	—	—
Average	42.9 ms	133.2 ms	129.8 ms

The ForcedAligner achieves a 67–77% reduction in alignment error compared to WhisperX and NFA, averaging just 42.9ms absolute shift. It supports 11 languages: Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish. This makes it ideal for subtitle generation, karaoke timing, audio editing, and any application requiring precise word-level synchronization.

Qwen3-ASR vs. Whisper vs. GPT-4o Transcribe

How does Qwen3-ASR stack up against the most popular alternatives?

Feature	Qwen3-ASR-1.7B	Whisper-large-v3	GPT-4o-Transcribe
Open-source	Yes (Apache 2.0)	Yes	No (API only)
Languages	52 (30 + 22 dialects)	99	~50+
Chinese performance	SOTA (2–3x better)	Moderate	Weak
English performance	Competitive	Strong	Strong
Singing/BGM	SOTA (<6% WER solo)	Poor	Poor
Streaming	Yes (unified model)	No	Yes
Forced alignment	Yes (42.9ms avg)	Via WhisperX (133ms)	No
Context biasing	Yes (10K tokens)	No	No
Self-hostable	Yes	Yes	No

Whisper still leads in raw language count (99 vs. 52), and GPT-4o edges ahead slightly on clean English benchmarks like LibriSpeech. But Qwen3-ASR dominates in Chinese, multilingual robustness, singing voice recognition, streaming capability, and the forced alignment use case. For most real-world applications — especially those involving noisy audio, multiple languages, or specialized terminology — Qwen3-ASR is the stronger choice. If you want to compare Qwen models against other LLMs more broadly, check our Qwen vs. ChatGPT comparison.

Final Verdict

Qwen3-ASR is a major step forward for open-source speech recognition. It doesn't just match commercial alternatives — it surpasses them in key areas like Chinese transcription, singing voice recognition, and word-level alignment precision. The unified streaming/offline design, context biasing capability, and the lightweight model sizes make it practical for everything from personal projects to production deployments.

The Apache 2.0 license removes any commercial restrictions, and the official Python package makes getting started as simple as pip install qwen-asr. Whether you're building a transcription service, adding speech input to an application, or creating subtitles, Qwen3-ASR deserves serious consideration.

For the companion text-to-speech models, see our Qwen3-TTS guide. Explore the full Qwen 3 family, try Qwen AI Chat, or check our guide to running Qwen models locally.