Qwen3-TTS: Alibaba's Open-Source Text-to-Speech System

Qwen3-TTS is Alibaba Cloud's open-source text-to-speech system, released in late November 2025 and updated through January 2026. It goes far beyond simple voice synthesis — offering three distinct modes: voice cloning from just 3 seconds of audio, voice design that creates entirely new voices from text descriptions, and custom voice with 9 pre-built speakers and full emotional control. The system supports 10 languages, runs locally on consumer GPUs with as little as 4 GB of VRAM, and is released under the Apache 2.0 license.

What makes Qwen3-TTS stand out in a crowded TTS landscape isn't any single feature — it's the combination of all three generation modes in one open-source family, with quality that community testers consistently describe as competitive with or exceeding commercial services like ElevenLabs. The multi-codebook tokenizer preserves acoustic details that other models lose (laughs, sighs, breathing patterns), and the dual-track architecture achieves a 97ms first-packet latency for real-time applications. For an overview of the broader ecosystem, see the Qwen 3 family page.

Qwen3-TTS four capabilities illustrated: Clone (voice cloning from audio reference), Design (create new voices from text prompts), Control (adjust emotions and speaking style), and Smart (handle complex text including equations and mixed scripts)
Qwen3-TTS four core capabilities: voice cloning, voice design from text, emotional control, and smart text handling.

Model Variants

The Qwen3-TTS family consists of multiple specialized models across two sizes:

Model Parameters Mode VRAM
Qwen3-TTS-12Hz-1.7B (Base) 1.7 billion Voice cloning ~5–6 GB
Qwen3-TTS-12Hz-1.7B-CustomVoice 1.7 billion 9 pre-built voices + emotion control ~5–6 GB
Qwen3-TTS-12Hz-1.7B-VoiceDesign 1.7 billion Create voices from text prompts ~5–6 GB
Qwen3-TTS-12Hz-0.6B (Base) 0.6 billion Voice cloning (lighter) ~3.5–4 GB
Qwen3-TTS-12Hz-0.6B-CustomVoice 0.6 billion Pre-built voices (lighter) ~3.5–4 GB

All models are released under Apache 2.0. The 1.7B variants produce noticeably better quality and expressiveness, while the 0.6B variants are ideal for speed-sensitive applications or limited hardware. The base model files are under 4 GB for the 1.7B and under 2 GB for the 0.6B, making them very accessible.

The Three Modes of Qwen3-TTS

Unlike most TTS systems that offer a single generation approach, Qwen3-TTS provides three fundamentally different ways to generate speech:

1. Voice Cloning (Base Model)

Clone any voice with as little as 3 seconds of reference audio. The model extracts the speaker's timbre, cadence, and acoustic characteristics, then generates new speech in that voice. Community testers report that even 3 seconds can produce recognizable clones, though longer reference clips (8–15 seconds) significantly improve quality.

  • Zero-shot — No training required. Upload audio, type text, generate.
  • Cross-lingual — Clone a voice speaking Spanish and make it speak English, Japanese, or any of the 10 supported languages while preserving the original timbre.
  • X-Vector mode — If you don't want to transcribe the reference audio, enable "X-Vector Only" to extract the voice identity from the audio alone (slightly lower quality than providing the transcript).

2. Custom Voice (Pre-built Speakers + Emotion)

Choose from 9 pre-built high-quality voices and control how they speak through text instructions. You can make a voice sound sad, angry, whispering, excited, sarcastic, or any other emotional tone.

Voice Native Language Description
AidenEnglishMale, clear and versatile
RyanEnglishMale, warm and natural
VivienChineseFemale, bright and slightly edgy
SojiJapaneseMale, calm and measured
DylanKoreanMale, articulate
+ 4 additional voices for Chinese dialects

All voices work across all 10 languages, though they perform best in their native language. Using a Chinese-native voice for English text produces a natural accent effect that can be useful for specific applications.

3. Voice Design (Text-to-Voice)

This is the most innovative feature — and one that's rarely seen in TTS systems. You can create an entirely new voice from scratch using only a text description. No reference audio needed at all.

You describe the voice you want — gender, age, accent, personality, speaking pace, emotional baseline — and the model generates a voice matching that description. Examples tested by the community include:

  • "Very old man, raspy and weak voice" — produces an elderly, trembling voice
  • "Sassy, flirty female in her 20s, dynamic expressive vocal range" — produces an animated, youthful voice
  • "Middle-aged adult, authoritative, confident and performative" — produces a broadcast-quality voice
  • "Cute cartoon chipmunk voice" — produces a high-pitched animated character voice

The Qwen team provides detailed prompting templates that let you specify gender, pitch, speed, accent, age, background personality, and gradual control over how the voice evolves throughout the text.

Multi-Speaker Podcast Generation

A unique capability: you can define multiple speakers in a single prompt and have them converse naturally. For example, define "Lucas" (male) and "Mia" (female) with distinct personality traits, then write a dialogue transcript. Qwen3-TTS generates the full conversation with voice-switching, making it possible to create podcast-style audio from a single generation.

Architecture: Dual-Track Multi-Codebook

Qwen3-TTS introduces two key architectural innovations that set it apart from traditional TTS systems:

Multi-Codebook Tokenizer (12Hz)

Traditional audio codecs compress audio into a single stream, losing fine details like breathing patterns, laughter, and acoustic environment. Qwen3-TTS uses a multi-codebook tokenizer that maintains multiple parallel representations of the same audio. This preserves both high-level semantic information and fine acoustic details simultaneously.

The results speak for themselves: the tokenizer achieves a PESQ score of 3.21 (wideband, broadcast-quality), an STOI of 0.96 (near-perfect intelligibility), and a speaker similarity (SIM) of 0.95 — meaning almost zero identity loss during synthesis.

Qwen-TTS-Tokenizer-12Hz benchmark comparison table showing PESQ, STOI, UTMOS, and SIM scores vs SpeechTokenizer, X-codec, X-codec 2, XY-Tokenizer, and Mimi — Qwen leads across all metrics
Qwen-TTS-Tokenizer-12Hz leads all competitors in speech reconstruction quality: PESQ, STOI, UTMOS, and speaker similarity.

Dual-Track Architecture

Traditional TTS uses two stages: a language model to predict tokens, then a diffusion model to generate audio. Qwen3-TTS replaces this with a dual-track architecture using discrete multi-codebook modeling that processes multiple streams in parallel. The practical benefits:

Benchmarks & Performance

Qwen3-TTS-1.7B achieves competitive or leading results across voice cloning, voice design, and custom voice benchmarks:

Qwen3-TTS-12Hz-1.7B benchmark table comparing Voice Clone, Voice Design, and Custom Voice performance against Qwen3-Omni-30B, MiniMax-Speech, ElevenLabs, GPT-4o, CosyVoice3, and Mimo-Audio
Qwen3-TTS-12Hz-1.7B vs. competitors: lower scores are better for clone metrics, higher for design/instruction following (APS/DSD).

Key Benchmark Highlights

Category Metric Qwen3-TTS-1.7B Best Competitor
Voice Clone (Seed-test zh/en) WER (lower=better) 0.77 / 1.24 0.83 / 1.65 (MiniMax)
Voice Clone (Multilingual) Content accuracy 1.835 1.906 (Qwen3-Omni)
Voice Clone (Cross-lingual) Quality score 4.418 4.623 (Qwen3-Omni)
Voice Design APS / DSD (higher=better) 84.1 / 81.8 82.3 / 81.6 (MiniMax)
Custom Voice (Instruction) Eval score (higher=better) 75.4 87.1 (Gemini-pro)
Speaker Similarity SIM (higher=better) 0.95

Qwen3-TTS leads in voice cloning accuracy and voice design quality. On instruction-following for custom voices, commercial models like Gemini-pro still hold an edge, but the gap is narrowing rapidly — and Qwen3-TTS is the only fully open-source option competing at this level.

Qwen3-TTS-12Hz-0.6B benchmark table showing Voice Clone and Custom Voice performance for the lighter model variant compared to competitors
The 0.6B lighter variant maintains competitive benchmarks while running on minimal hardware.

Supported Languages

Qwen3-TTS supports 10 major languages:

Primary languages: Chinese (zh), English (en), Japanese (ja), Korean (ko), Spanish (es)
Additional languages: German (de), French (fr), Russian (ru), Portuguese (pt), Italian (it)

The model handles cross-lingual generation well — you can clone a voice that speaks Spanish and have it output English while maintaining the original timbre and character. Community testers have also noted that accented voices transfer naturally across languages, producing authentic-sounding accented speech.

How to Run Qwen3-TTS Locally

There are multiple ways to run Qwen3-TTS depending on your technical level:

Option 1: Official Web UI (Python)

The most direct approach using the official repository.

  1. Clone the repository:
    git clone https://github.com/QwenLM/Qwen3-TTS
    cd Qwen3-TTS
    pip install -r requirements.txt
  2. Download models from Hugging Face:
    huggingface-cli download Qwen/Qwen3-TTS-12Hz-1.7B
  3. Launch the demo for each mode:
    # Voice cloning (base model)
    python qwen-tts-demo.py --model Qwen3-TTS-12Hz-1.7B --port 8000

    # Custom voice
    python qwen-tts-demo.py --model Qwen3-TTS-12Hz-1.7B-CustomVoice --port 8001

    # Voice design
    python qwen-tts-demo.py --model Qwen3-TTS-12Hz-1.7B-VoiceDesign --port 8002

Option 2: ComfyUI (Graphical Interface)

The easiest option for users who prefer a visual, node-based workflow. Requires ComfyUI installed.

  1. Navigate to ComfyUI/custom_nodes/ and clone the Qwen TTS workflow repository.
  2. Install requirements from the new folder.
  3. Restart ComfyUI and find the Qwen3-TTS template under Templates.
  4. Models download automatically on first run (~6 GB total).

The ComfyUI workflow provides separate nodes for voice cloning, custom voice, and voice design — all accessible in a single interface. Generation takes approximately 15–20 seconds for a typical sentence on consumer hardware.

Option 3: Google Colab (Free, No GPU Required)

If you don't have a dedicated GPU, you can run Qwen3-TTS for free using Google Colab with a T4 GPU.

  1. Open the community-provided Colab notebook (linked from the Qwen3-TTS repository).
  2. Select GPU runtime (T4 is free).
  3. Run the installation cells and start generating.

Inference takes slightly longer on Colab (~25 seconds per sentence with the 1.7B model) but it's completely free and requires zero local setup.

Hardware Requirements

Model Model Size VRAM (Inference) Speed (Typical Sentence)
1.7B variants ~4 GB ~5–6 GB ~15–20 seconds
0.6B variants ~2 GB ~3.5–4 GB ~10–15 seconds

The models are remarkably lightweight. The 0.6B variant can run on virtually any modern GPU (even a GTX 1660 with 6 GB VRAM), while the 1.7B variant fits comfortably on cards like the RTX 3060 or higher. Generation speed depends on text length and hardware, but typical sentences complete in 10–20 seconds on consumer GPUs.

For general guidance on running Qwen models locally, see our local deployment guide and hardware requirements page.

Fine-Tuning Qwen3-TTS

Qwen3-TTS supports single-speaker fine-tuning to create a highly personalized voice model. This is useful when zero-shot cloning doesn't capture enough detail, or when you need consistent voice output across many generations.

What You Need

Process Overview

  1. Prepare your dataset — Use a dataset maker tool to transcribe audio files, slice them into segments (recommended: under 20 seconds each), and export in Qwen3-TTS format.
  2. Download the tokenizer — The Qwen3-TTS tokenizer is required separately for fine-tuning.
  3. Configure training — Set batch size (lower for less VRAM), number of epochs (10+ recommended), and learning rate.
  4. Train — Run the training script. Checkpoints are saved at configurable intervals.
  5. Test — Launch the demo with your trained checkpoint to evaluate quality.

Important Tips from the Community

Qwen3-TTS vs. Commercial Alternatives

Feature Qwen3-TTS ElevenLabs GPT-4o Audio
Open-source Yes (Apache 2.0) No No
Cost Free (self-hosted) $5–$99/month Pay per token
Voice cloning 3-second zero-shot Yes No
Voice design (text-to-voice) Yes (unique feature) No No
Emotion control Text-based prompting Style presets Limited
Languages 10 29+ 50+
Offline/local Yes No No
Fine-tuning Yes (single speaker) No No
Multi-speaker podcast Yes (single generation) Limited No
Min VRAM ~4 GB (0.6B model) N/A (cloud) N/A (cloud)

ElevenLabs still leads in language count and has a more polished API, but Qwen3-TTS offers something commercial services cannot: full local control, zero recurring costs, fine-tuning capability, and the unique voice design mode. For developers and content creators who need flexibility and privacy, Qwen3-TTS is a compelling alternative. For comparisons of Qwen models in other domains, see our Qwen vs. ChatGPT page.

Final Verdict

Qwen3-TTS is arguably the most feature-complete open-source TTS system available today. The combination of voice cloning, voice design, and emotionally-controlled custom voices in a single family — all running on consumer hardware — is unprecedented. The quality consistently impresses testers, with the voice design capability being genuinely novel.

It's not perfect: language support is limited to 10 languages (ElevenLabs offers 29+), instruction-following for emotions can be inconsistent on the 0.6B model, and the voice design mode requires experimentation with prompts to get optimal results. But these are minor caveats for a system that is free, open-source, and runs locally.

For content creators looking to generate multilingual voiceovers, developers building voice-enabled applications, or anyone who wants unlimited TTS without monthly subscriptions — Qwen3-TTS delivers serious value.

For the companion speech recognition models, see our Qwen3-ASR guide. Explore the full Qwen 3 family, try Qwen AI Chat, or check our guide to running Qwen models locally.