Run Qwen with MLX on Apple Silicon

MLX is the fastest way to run Qwen on any Mac with Apple Silicon. Built by Apple's own machine learning team, the framework talks directly to the Metal GPU and unified memory architecture that makes M-series chips unique. The result: roughly 2x faster token generation and 3-5x faster prompt processing compared to Ollama, with about half the memory footprint.

That's not marketing. Community benchmarks consistently show MLX pulling 60-70 tokens per second on an M4 Max running Qwen 3.5 35B-A3B at 4-bit quantization. Ollama on the same hardware tops out around 35. If you own a Mac with M1 or newer, MLX is the recommended way to run Qwen locally — and this guide covers everything from installation to real-world performance numbers across every M-series chip.

Not on a Mac? Check our complete guide to running Qwen locally for NVIDIA, AMD, and CPU options, or use the Can I Run Qwen tool to see what fits your hardware.

Apple MLX framework logo — machine learning on Apple Silicon
MLX — Apple's framework for fast ML inference on Apple Silicon

How to Install MLX for Qwen

You need three things: a Mac with Apple Silicon (M1 or newer), Python 3.10+, and a virtual environment. The whole setup takes under a minute.

python3 -m venv mlx-qwen && source mlx-qwen/bin/activate
pip install mlx-lm

That's it. mlx-lm handles text generation for all Qwen language models. If you also want vision capabilities (image and video understanding), add the companion package:

pip install mlx-vlm

No CUDA drivers, no container runtimes, no configuration files. MLX compiles Metal shaders on first run and caches them — expect a brief pause the first time you load a model, then near-instant startup after that.

Qwen MLX Models on HuggingFace

The mlx-community on HuggingFace maintains pre-quantized Qwen models optimized for Apple Silicon. These are 4-bit quantizations that balance quality and speed — the sweet spot for local inference on Mac hardware.

ModelHuggingFace IDMin RAMBest For
Qwen3.5 0.8Bmlx-community/Qwen3.5-0.8B-MLX-4bit8GBQuick tasks, drafts, testing pipelines
Qwen3.5 2Bmlx-community/Qwen3.5-2B-MLX-4bit8GBChat, summarization, light coding
Qwen3.5 4Bmlx-community/Qwen3.5-4B-MLX-4bit8GBGeneral-purpose, daily driver on base M1/M2
Qwen3.5 9Bmlx-community/Qwen3.5-9B-MLX-4bit8GBStrong reasoning, the sweet spot for most users
Qwen3.5 27Bmlx-community/Qwen3.5-27B-MLX-4bit32GBNear-frontier quality, demanding tasks
Qwen3.5 35B-A3Bmlx-community/Qwen3.5-35B-A3B-MLX-4bit16GBBest speed-to-quality ratio (MoE, only 3B active)

The 35B-A3B deserves special attention. It's a Mixture-of-Experts model with 35 billion total parameters, but only activates 3 billion per token. That means it runs nearly as fast as the 4B model while delivering intelligence closer to the 27B. On Macs with 16GB+ unified memory, it's the standout pick. For a deeper look at how this model compares across benchmarks, see our Qwen 3.5 overview.

Running Your First Model

One command gets you generating text:

mlx_lm.generate --model mlx-community/Qwen3.5-9B-MLX-4bit --prompt "Explain quantum entanglement in plain English"

For multi-turn chat with proper formatting, use the Python API with the model's chat template:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-9B-MLX-4bit")

messages = [{"role": "user", "content": "What's the best way to learn Rust?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

response = generate(model, tokenizer, prompt=prompt, max_tokens=2048, temp=0.6, top_p=0.95)
print(response)

The key parameters: --temp 0.6 gives coherent but non-robotic output, --top-p 0.95 keeps generation focused, and --max-tokens 2048 is plenty for most responses. Bump max tokens higher for long-form generation — Qwen 3.5 models support up to 131K context.

Real-World Speed: M-Series Performance Table

This is the data you won't find anywhere else. Token generation speed on Apple Silicon depends primarily on memory bandwidth, not chip generation. An M3 Max with 400 GB/s bandwidth will outperform an M4 Pro with 273 GB/s — even though the M4 Pro is a newer chip. The table below reflects real-world community benchmarks, not theoretical maximums.

ChipBandwidth9B (4-bit)35B-A3B (4-bit)27B (4-bit)
M1 (8GB)68 GB/s~30 tok/sN/A (needs 16GB)N/A (needs 32GB)
M1 Pro (16GB)200 GB/s~35 tok/sTight fit, swapping likelyN/A
M2 Max (32GB)400 GB/s~50 tok/s~45 tok/s~30 tok/s
M2 Ultra (64GB+)800 GB/s~60 tok/s~55 tok/s~40 tok/s
M3 Max (36GB+)400 GB/s~55 tok/s~50 tok/s~35 tok/s
M4 Pro (24GB)273 GB/s~60 tok/s~55 tok/s~30 tok/s
M4 Max (36GB+)546 GB/s~70 tok/s~60-70 tok/s~40 tok/s

Numbers represent approximate token generation speed (tokens/second) for 4-bit quantized models via mlx-lm. Actual performance varies with prompt length, generation length, and system load. Sources: community benchmarks from r/LocalLLaMA, MLX Discord, and X posts.

A few things jump out. The M4 Max hits 60-70 tok/s on the 35B-A3B — that's a 35-billion-parameter model generating faster than most people can read. The M2 Ultra's 800 GB/s bandwidth makes it the throughput champion for the 27B dense model, though few people own one. And the base M1 with 8GB can still run the 9B at a respectable 30 tok/s, which is perfectly usable for chat.

The 35B-A3B column tells the real story. Because it only activates 3B parameters per token, it runs at speeds you'd expect from a much smaller model — but answers like a much larger one. On an M4 Max, you're getting near-frontier intelligence at 60-70 tok/s. That's the MoE advantage in action.

MLX vs Ollama on Mac: Why the Gap Is So Large

Ollama is the most popular way to run models locally, and it works fine on Mac. But "fine" isn't the same as "optimized." Ollama uses llama.cpp under the hood, which targets GGUF-format models and was built primarily for NVIDIA/CPU hardware. It can use Metal on Mac, but it doesn't exploit Apple Silicon the way a native framework does.

MLX was built from scratch for unified memory. It doesn't copy tensors between CPU and GPU — they share the same memory space, eliminating a bottleneck that GGUF-based tools can't fully avoid. The speed difference shows up in two places:

MetricMLXOllama (llama.cpp)Difference
Token generation~60-70 tok/s (M4 Max, 35B-A3B)~35 tok/s~2x faster
Prompt processing2-3s for a 10-page doc10-15s3-5x faster
Memory usageBaseline~2x higher~50% less with MLX

The prompt processing gap matters more than it sounds. If you're feeding a long document into a model for summarization or Q&A, waiting 10-15 seconds before the first token appears versus 2-3 seconds is the difference between a tool that feels instant and one that feels sluggish. For agentic workflows where the model processes context repeatedly, MLX's prefill speed turns a frustrating wait into a non-issue.

The tradeoff? MLX only runs on Apple Silicon. If you also use a Linux workstation with NVIDIA GPUs, Ollama gives you one tool that works everywhere. But if your Mac is your primary development machine, MLX is the clear winner. See our Ollama guide for setup instructions on that side.

Running Qwen Vision Models with MLX

Qwen's multimodal models — the ones that understand images and video alongside text — work on MLX through the mlx-vlm package. Installation is separate from the text-only mlx-lm:

pip install mlx-vlm

Running inference with an image:

mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit   --prompt "Describe what you see in this image"   --image "path/to/image.jpg"

The 2B vision model fits comfortably on any Apple Silicon Mac, but don't expect GPT-4V-level analysis from it. It handles basic image description, OCR, and chart reading well. For more demanding vision tasks, the larger Qwen-VL variants require 16GB+ of unified memory. Qwen 3.5 vision models support both mlx-lm (text-only mode) and mlx-vlm (full vision+text), so you can use the same model weights for both use cases.

Creating Custom Quantizations

The pre-quantized models from mlx-community work well for most users. But if you want a specific quantization level (3-bit for tighter memory, 8-bit for higher quality) or want to quantize a fine-tuned model, MLX makes it straightforward:

mlx_lm.convert --hf-path Qwen/Qwen3.5-9B --mlx-path ./Qwen3.5-9B-4bit/ -q

This downloads the full-precision weights from HuggingFace and converts them to 4-bit MLX format. The -q flag enables quantization. You can specify a different bit depth with --q-bits 8 or --q-bits 3. The conversion takes a few minutes depending on model size and your internet connection — the actual quantization step is fast.

One practical use case: if you're fine-tuning Qwen Coder models with LoRA adapters and want to run them on your Mac, convert the merged model to MLX format and get native Apple Silicon speed rather than going through Ollama's GGUF pipeline.

Known Issues and Limitations

MLX is fast, but it's not without rough edges. Be aware of these before committing to it as your daily driver:

Tool calling degrades after extended conversations. Community testing by @TeksEdge found that quantized MLX models start producing malformed tool calls after 5-10 rounds of function calling. If you're building an agent that relies heavily on tool use, test thoroughly at your expected conversation length. This appears to be a quantization artifact, not an MLX bug — full-precision models don't exhibit the issue.

No built-in server mode. Unlike Ollama, MLX doesn't include an HTTP API server out of the box. If you need an OpenAI-compatible endpoint for your application, use LM Studio — it uses MLX as its backend on Mac, gives you the same speed benefits, and adds a proper API server with chat UI on top. It's the best of both worlds.

Apple Silicon only. There's no NVIDIA support, no AMD support, no Intel fallback. If you have a mixed fleet of machines, you'll need llama.cpp or Ollama for cross-platform compatibility.

Ecosystem maturity. MLX's Python ecosystem is younger than llama.cpp's. Some advanced features — speculative decoding, grammar-constrained generation, batch inference — are either experimental or missing. The framework is evolving quickly (Apple ships updates almost weekly), but if you need production-grade serving features today, consider vLLM or TGI on a Linux machine instead.

From the Community

Here's what Mac developers are reporting about running Qwen with MLX:

The sentiment is common across the MLX community. Developers who've tried both Ollama and MLX on the same hardware consistently report that switching to MLX feels like a hardware upgrade — same Mac, noticeably faster inference.

The availability of Qwen models at multiple quantization levels (4-bit, 6-bit, 8-bit) gives Mac users granular control over the speed-vs-quality tradeoff that GGUF models also offer — but with MLX's native speed advantage on top.

Frequently Asked Questions

Should I use MLX or Ollama on Mac?

MLX, every time. It's roughly 2x faster for generation, 3-5x faster for prompt processing, and uses about 50% less memory. The only reason to stick with Ollama on Mac is if you need its built-in API server and don't want to use LM Studio as the middleman.

Which Mac do I need to run Qwen with MLX?

Any M1 or newer with at least 8GB unified memory. The base M1 with 8GB can run models up to 9B at 4-bit quantization (~30 tok/s). For the 35B-A3B MoE model, you want 16GB minimum — an M2 Pro or M3 Pro works. For the 27B dense model, 32GB+ is required. Check the performance table above for specific chip speeds, or use our Can I Run Qwen tool for personalized hardware recommendations.

Can I use MLX as an API server?

Not directly — mlx-lm doesn't include a built-in HTTP server. Your best option is LM Studio, which uses MLX as its inference backend on Mac and exposes an OpenAI-compatible API endpoint. You get MLX speed with proper server functionality, no workarounds needed.

Is MLX quantization quality the same as GGUF?

At the same bit depth (4-bit), quality is comparable. Both use similar quantization techniques and the perplexity difference is negligible for practical use. The advantage of MLX isn't in quantization quality — it's in inference speed on Apple Silicon. Same model quality, significantly faster execution.

Does MLX support Qwen's thinking mode?

Yes. Qwen 3.5 models with built-in thinking/reasoning work with MLX. Set the appropriate chat template and the model will use chain-of-thought reasoning just like it does through any other backend. The thinking tokens are generated at the same accelerated speed as regular tokens.