Run Qwen with llama.cpp
If you want the absolute fastest local Qwen inference on NVIDIA hardware, llama.cpp is it. Not Ollama, not vLLM, not a Python wrapper — raw llama.cpp with the right flags will push 2-5x more tokens per second than Ollama on the same GPU running the same model. On an RTX 3090 with Qwen3.5-35B-A3B, that's the difference between 15-20 tok/s and 100 tok/s.
The tradeoff is setup. Ollama gives you a one-liner. llama.cpp asks you to pick quantization formats, set cache types, and possibly build from source. This guide walks through all of it — install, configuration, quantization, performance tuning, and the ik_llama.cpp fork that makes Qwen 3.5 27B actually usable for multi-turn conversations.
Not sure your GPU can handle it? Check our Can I Run Qwen tool first. If you'd rather skip the setup entirely, the Ollama guide gets you running in one command. Mac users should also look at MLX, which outperforms llama.cpp on Apple Silicon.
In This Guide
Three Ways to Install llama.cpp
Option 1: Build from Source (Recommended)
Building from source gives you the latest optimizations — and for Qwen models specifically, recent commits have delivered ~2.5x inference speedups that haven't reached all package managers yet. If you have an NVIDIA GPU, this is the path that unlocks full CUDA acceleration.
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
For Mac with Metal (Apple Silicon):
cmake -B build -DGGML_METAL=ON
cmake --build build --config Release -j $(sysctl -n hw.ncpu)
The binaries land in build/bin/. The two you'll use most: llama-cli (interactive chat) and llama-server (OpenAI-compatible API).
Option 2: Homebrew (macOS/Linux)
The fastest install, courtesy of llama.cpp creator Georgi Gerganov himself:
brew install llama.cpp
One line and you're done. The downside: Homebrew builds may lag behind source by days or weeks, meaning you could miss performance patches that matter. For casual use, it's fine. For squeezing every token per second, build from source.
Option 3: Pre-built Binaries
GitHub Releases (github.com/ggml-org/llama.cpp/releases) ships pre-compiled binaries for Windows, Linux, and macOS. Grab the one matching your platform and GPU. Windows users: download the cuda12 variant if you have an NVIDIA card.
Quick Start: Running Qwen in 30 Seconds
llama.cpp can download GGUF models directly from Hugging Face. No manual download required.
Interactive Chat
llama-cli -hf Qwen/Qwen3-8B-GGUF:Q8_0 --jinja -c 8192 -ngl 99 -fa --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 -cnv
This pulls the Q8_0 quantization of Qwen3-8B, offloads all layers to GPU (-ngl 99), enables flash attention (-fa), and starts an interactive conversation. The --jinja flag activates the model's chat template — skip it and you'll get garbled output.
Server Mode (OpenAI-Compatible API)
llama-server -hf Qwen/Qwen3.5-9B-GGUF:Q4_K_M --jinja -c 8192 -ngl 99 -fa --reasoning-format deepseek --host 0.0.0.0 --port 8080
This spins up a server at localhost:8080 with a built-in web UI and an OpenAI-compatible API at /v1/. The --reasoning-format deepseek flag properly handles Qwen's thinking tokens in the Qwen 3.5 models.
Homebrew One-liner (Georgi Gerganov's Pick)
llama-server --fim-qwen-30b-default
This preset auto-downloads Qwen3-30B-A3B with fill-in-the-middle support — designed for code completion. Straight from the llama.cpp creator.
Key Flags Reference
llama.cpp has dozens of flags. These are the ones that actually matter for Qwen models, grouped by what they do.
| Flag | What It Does | Recommended Value |
|---|---|---|
--jinja | Activates chat template. Required for proper Qwen output. | Always on |
-ngl N | Offload N layers to GPU. Use 99 or 999 to offload everything. | 99 |
-c N | Context size in tokens. Higher = more VRAM. | 8192 (start here) |
-fa | Flash attention. Cuts VRAM usage and speeds up long contexts. | Always on |
--fit | Auto-split model between GPU and CPU based on available VRAM. | On (if model doesn't fit GPU) |
--cache-type-k | KV cache key quantization. Major speed boost. | q8_0 |
--cache-type-v | KV cache value quantization. | q8_0 |
--temp | Sampling temperature. Lower = more focused. | 0.6 |
--top-k | Top-K sampling. | 20 |
--top-p | Nucleus sampling threshold. | 0.95 |
--min-p | Minimum probability filter. Set to 0 for Qwen's default. | 0 |
--no-context-shift | Prevents automatic context truncation when hitting the limit. | On for long conversations |
--reasoning-format | Handles thinking tokens properly in server mode. | deepseek |
-np N | Number of parallel slots in server mode. | 1 (for single-user) |
--threads N | CPU threads for inference. | Half your cores |
MoE-Specific: Offloading Expert Layers to CPU
Qwen 3.5's MoE models (35B-A3B, 122B-A10B) have massive expert layers that eat VRAM. If your GPU can't hold the full model, you can selectively push just the expert layers to CPU while keeping everything else on the GPU:
-ot ".ffn_.*_exps.=CPU"
This is more surgical than --fit, which splits layers uniformly. With MoE, the expert tensors are disproportionately large, so targeting them specifically keeps attention and routing on the GPU where speed matters most.
Quantization Guide: Which GGUF Format to Pick
GGUF quantization controls the tradeoff between model quality, file size, and VRAM usage. The naming convention is straightforward: Q8 = 8-bit, Q4 = 4-bit, K_M = medium quality variant. Lower bits = smaller and faster, but dumber.
| Format | Quality | Size vs FP16 | Best For |
|---|---|---|---|
| Q8_0 | Near-lossless | ~50% | Maximum quality when VRAM allows |
| Q5_K_M | Excellent | ~35% | Coding tasks — preserves instruction-following |
| Q4_K_M | Very good | ~28% | General use — the default sweet spot |
| Q3_K_M | Good | ~22% | Tight VRAM budgets |
| IQ4_NL | Good (imatrix) | ~28% | Better quality at 4-bit with calibration data |
Start with Q4_K_M. It's the community default for a reason — the quality loss is minimal for general conversation and reasoning. If you're using Qwen for code generation with Qwen Coder, step up to Q5_K_M. The extra bit preserves the precision that matters for syntax and logic. Q8_0 is for purists with the VRAM to spare.
For aggressive quantization below Q4, consider imatrix-calibrated quants. Running llama-imatrix on a representative dataset before quantizing produces significantly better results at Q3 and below — the importance matrix tells the quantizer which weights to preserve.
VRAM Requirements by Model and Quantization
| Model | Q4_K_M | Q5_K_M | Q8_0 |
|---|---|---|---|
| Qwen3.5-9B | ~6 GB | ~7 GB | ~10 GB |
| Qwen3.5-27B | ~16 GB | ~20 GB | ~30 GB |
| Qwen3.5-35B-A3B | ~18 GB | ~22 GB | ~35 GB |
| Qwen3.5-122B-A10B | ~60 GB | ~75 GB | ~120 GB |
These numbers include some overhead for KV cache at 8K context. Increase context length and VRAM goes up — roughly 0.5-1 GB extra per 4K tokens on larger models. The MoE models (35B-A3B, 122B-A10B) have misleadingly high total parameter counts but only activate a fraction per token, so their actual inference speed is much faster than the VRAM footprint suggests.
Real-World Performance: llama.cpp vs Ollama
The speed gap between llama.cpp and Ollama is not subtle. On an RTX 3090 running Qwen3.5-35B-A3B with Q4_K_M quantization, community benchmarks show:
| Backend | Model | GPU | Speed |
|---|---|---|---|
| llama.cpp (with cache flags) | Qwen3.5-35B-A3B Q4 | RTX 3090 | ~100 tok/s |
| llama.cpp (default flags) | Qwen3.5-35B-A3B Q4 | RTX 3090 | ~50 tok/s |
| Ollama | Qwen3.5-35B-A3B Q4 | RTX 3090 | ~15-20 tok/s |
That's a 5x difference just by switching backends and adding two flags. The cache type flags alone are responsible for doubling speed from 50 to 100 tok/s, as reported by community member @sudoingX:
--cache-type-k q8_0 --cache-type-v q8_0 -np 1
These flags quantize the KV cache during inference, which cuts memory bandwidth pressure without measurable quality loss. There's no reason not to use them.
Performance by GPU Tier
| GPU | Model (Q4_K_M) | Speed (tok/s) | Notes |
|---|---|---|---|
| RTX 4090 (24 GB) | Qwen3.5-9B | 80-120 | Fully GPU-resident, flash attention on |
| RTX 3090 (24 GB) | Qwen3.5-35B-A3B | 50-100 | MoE sweet spot — 100 with cache flags |
| RTX 3090 (24 GB) | Qwen3.5-27B Dense | ~35 | All 27B parameters active every token |
| RTX 3060 (12 GB) | Qwen3.5-9B | 40-60 | Q4_K_M fits comfortably |
| Mac M4 Max | Qwen3.5-9B | 40-50 | GGUF via Metal — consider MLX instead |
A recent llama.cpp update delivered a ~2.5x inference speedup specifically for Qwen's architecture. If you built from source more than a few weeks ago, rebuild — the difference is real.
Known Issue: Full Re-processing on Qwen 3.5 27B
Qwen 3.5 27B uses a hybrid attention architecture (standard attention + Mamba2/SSM layers). This means llama.cpp can't cache KV state between conversation turns — it re-processes the entire prompt from scratch every time you send a message. With a 4K context, that's a few seconds of delay. At 16K+, it becomes painful.
This is a fundamental architectural limitation, not a bug. The 9B and MoE models don't have this issue. If you need the 27B specifically, the ik_llama.cpp fork below makes it bearable.
ik_llama.cpp: 26x Faster Prompt Processing
This is the section nobody else has documented properly, and it matters enormously if you're running Qwen 3.5 27B.
Developer ikawrakow maintains a fork of llama.cpp with aggressive CUDA kernel fusion optimizations. The headline number: prompt processing goes from 43 tok/s to 1,122 tok/s on Qwen 3.5 27B Q4_K_M — a 26x speedup. Tested on an RTX PRO 4000 Blackwell with a Xeon W-2295.
Why does prompt processing speed matter so much? Because of the re-processing issue described above. Every turn in a conversation with the 27B model re-processes the full context. At 43 tok/s, a 4K context takes ~93 seconds to re-process. At 1,122 tok/s, it takes ~3.6 seconds. That's the difference between "unusable" and "fast enough for real-time work."
How to Build ik_llama.cpp
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j $(nproc)
The build process is identical to mainline llama.cpp. The fork is fully compatible with standard GGUF files — you don't need to re-download or re-quantize anything. Just point it at the same model files you already have.
Running It
# GPU mode (recommended)
./build/bin/llama-server \
--model /path/to/Qwen3.5-27B-Q4_K_M.gguf \
--ctx-size 4096 -ngl 999
# CPU-only mode
./build/bin/llama-server \
--model /path/to/Qwen3.5-27B-Q4_K_M.gguf \
--ctx-size 4096
Open http://127.0.0.1:8080 in your browser and you'll get the same web UI as mainline llama.cpp.
What Else ik_llama.cpp Does Better
Beyond kernel fusion, the fork includes SOTA quantization types that go beyond what mainline llama.cpp offers, plus improved CPU inference paths. The developer is active and merges upstream changes regularly, so you're not trading stability for speed.
Our recommendation: If you run Qwen 3.5 27B regularly on NVIDIA hardware, use ik_llama.cpp. For other Qwen models (9B, MoE variants), mainline llama.cpp is fine — the re-processing issue doesn't affect them, so the 26x prompt speedup won't matter.
Server Mode and OpenAI-Compatible API
llama.cpp's server mode turns any Qwen model into a local API endpoint that speaks the same protocol as OpenAI's API. Any tool that works with GPT-4 — coding agents, chatbots, RAG pipelines — can point at your local llama.cpp server instead.
llama-server -hf Qwen/Qwen3.5-9B-GGUF:Q4_K_M --jinja -ngl 99 -fa \
--cache-type-k q8_0 --cache-type-v q8_0 \
--host 0.0.0.0 --port 8080
This exposes three things: a web UI at http://localhost:8080, a chat completions API at /v1/chat/completions, and a completions API at /v1/completions. The web UI is surprisingly good for testing prompts.
Python Client Example
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="qwen3.5",
messages=[{"role": "user", "content": "Explain MoE architectures in 3 sentences."}]
)
print(response.choices[0].message.content)
The api_key can be anything — the local server doesn't authenticate. The model field is also ignored since llama.cpp serves whichever model you loaded at startup, but the OpenAI client library requires it.
From the Community
Here's what developers running Qwen with llama.cpp are reporting:
first impressions of qwen 3.5 27B dense on a single RTX 3090. 35 tok/s. from 4K all the way to 300K+ context. no speed drop. MoE held 112 flat. 3x faster but only 3B of 35B active per token.
— @sudoingX March 2026
The flat speed across context lengths is noteworthy. For most transformer-based models, speed degrades as context grows. Qwen 3.5's hybrid architecture with linear attention layers avoids that penalty almost entirely — a practical benefit you won't find in benchmark tables.
PSA: Ollama and LM Studio won't load Qwen 3.5-35B-A3B yet, the MoE architecture isn't supported in standard loaders. What works: llama.cpp built from source + Unsloth GGUF quants (Q4_K_XL, 18GB).
— @MrE_Btc March 2026
Compatibility is a real issue with cutting-edge MoE models. When new Qwen architectures drop, llama.cpp built from source is usually the first — and sometimes only — way to run them locally. If you hit loading errors in other tools, this is why.
FAQ
Should I Build from Source or Use Homebrew?
Build from source if you have an NVIDIA GPU and want peak performance. The CUDA build path is where the speed optimizations live. Homebrew is fine for Mac users doing casual testing — you can always switch to a source build later.
What Quantization Should I Use?
Q4_K_M for general use. Step up to Q5_K_M if you're doing coding tasks where instruction precision matters. Q8_0 only if you have the VRAM to spare and want near-lossless quality. See the full quantization table above.
Can I Run Qwen on CPU Only?
Yes, but slowly. Remove the -ngl flag and set --threads to half your CPU core count. Expect single-digit tok/s for anything above 8B parameters. CPU-only is viable for testing prompts, not for production use. Check our local deployment hub for more options.
How Do I Split a Model Between GPU and CPU?
Two approaches. The --fit flag auto-detects available VRAM and splits layers accordingly — easiest option. For MoE models specifically, use -ot ".ffn_.*_exps.=CPU" to push only the expert layers to CPU while keeping attention on the GPU. The second approach is more efficient because MoE expert tensors are disproportionately large.
Why Is Ollama So Much Slower?
Ollama adds abstraction layers for ease of use — model management, automatic template detection, a simpler CLI. Those layers cost performance. Ollama also doesn't expose cache type flags or flash attention toggles that make the biggest difference. For users who prioritize convenience over speed, Ollama is still a solid choice. For maximum performance, llama.cpp is the tool.
Do I Need ik_llama.cpp?
Only if you're running Qwen 3.5 27B specifically and doing multi-turn conversations. The 26x prompt processing speedup solves the re-processing bottleneck unique to the 27B's hybrid attention architecture. For the 9B, MoE models (35B-A3B, 122B-A10B), and all Qwen3 models, mainline llama.cpp works great.