Run Qwen with Ollama

Ollama is the fastest way to get Qwen running on your machine. One command pulls the model, quantizes it, and starts serving — no Python environment, no manual GGUF downloads, no config files. For text generation with Qwen 3.5 or any Qwen model, it genuinely takes under 60 seconds from install to first output.

That said, Ollama's Qwen support has real problems right now. Tool calling is broken. Vision models don't work for Qwen 3.5. Speed sits at roughly 15-20% of what llama.cpp delivers on the same hardware. If any of those matter to you, read the known issues section before committing — it's the most important part of this guide.

Quick verdict: If you want the simplest possible setup for text-only Qwen inference and don't need tool calling or vision, Ollama is the right choice. For production workloads, maximum speed, or Apple Silicon optimization, look at llama.cpp or MLX on Mac instead.

Install Ollama on Any Platform

Ollama runs on macOS, Linux, and Windows. Pick your platform and you'll be ready in under a minute.

macOS

brew install ollama

If you don't use Homebrew, grab the installer from ollama.com/download. Both Intel and Apple Silicon are supported natively.

Linux

curl -fsSL https://ollama.com/install.sh | sh

This installs the binary and sets up a systemd service. Works on Ubuntu, Debian, Fedora, Arch, and most mainstream distros. NVIDIA and AMD GPUs are auto-detected.

Windows

winget install Ollama.Ollama

Or download the MSI installer from ollama.com/download. Windows ARM64 has a native build as of 2026, so Surface Pro and Snapdragon laptops work without emulation.

Verify the Installation

Start the server and confirm everything works:

ollama serve
ollama list

If ollama list returns an empty table without errors, you're good. Now pull a model.

Available Qwen Models in Ollama

Ollama hosts official GGUF quants for most Qwen models. Every command below pulls the model and starts inference immediately — no extra steps. Not sure which size fits your GPU? Check our Can I Run Qwen? tool before downloading.

Overview of Qwen model sizes available in Ollama showing parameter counts and quantization options
Qwen models available through the Ollama library. The ecosystem spans from 0.6B to 235B parameters.

Qwen 3.5 — Small Models (0.8B to 9B)

These are the models most people should start with. The 9B is the sweet spot — it fits comfortably in 8GB VRAM and punches well above its weight on reasoning tasks.

Model Parameters VRAM (approx) Command
qwen3.5:0.8b 0.8B ~1 GB ollama run qwen3.5:0.8b
qwen3.5:2b 2B ~2 GB ollama run qwen3.5:2b
qwen3.5:5b 5B ~4 GB ollama run qwen3.5:5b
qwen3.5:9b 9B ~6-7 GB ollama run qwen3.5:9b

Qwen 3.5 — Large Models (35B and 122B)

The 35B-A3B is a MoE (Mixture of Experts) model — 35 billion total parameters, but only 3 billion active per token. That means it runs surprisingly fast on consumer hardware. The 122B needs serious GPU memory or multi-GPU setups.

Model Architecture VRAM (approx) Command
qwen3.5:35b-a3b MoE — 3B active ~20-22 GB ollama run qwen3.5:35b-a3b
qwen3.5:122b Dense ~70-80 GB ollama run qwen3.5:122b

Heads up: Early Ollama releases couldn't load the 35B MoE at all — the architecture wasn't supported in standard loaders. This has been fixed in recent versions, but if you hit errors, make sure you're on Ollama v0.17.5 or later.

Qwen 3 Models (Previous Generation)

Qwen 3 models are still solid and sometimes more stable in Ollama than the newer 3.5 family. If you need reliable tool calling or just want fewer rough edges, these are worth considering.

Model Parameters VRAM (approx) Command
qwen3:0.6b 0.6B ~1 GB ollama run qwen3:0.6b
qwen3:1.7b 1.7B ~1.5 GB ollama run qwen3:1.7b
qwen3:4b 4B ~3 GB ollama run qwen3:4b
qwen3:8b 8B ~5-6 GB ollama run qwen3:8b
qwen3:14b 14B ~10 GB ollama run qwen3:14b
qwen3:32b 32B ~20 GB ollama run qwen3:32b

Specialized Models

Model Type Command
qwen3-coder-next Code generation ollama run qwen3-coder-next
qwen2.5-coder Code generation (stable) ollama run qwen2.5-coder
qwen3-vl:8b Vision + Language ollama run qwen3-vl:8b

For coding tasks, see our Qwen Coder guide — it covers which coder variant to pick and how to connect it to your IDE. The vision model (qwen3-vl) is the only way to do image understanding through Ollama right now, since Qwen 3.5 vision isn't supported yet.

Known Issues with Qwen on Ollama (Read This First)

This is the section no other site writes. Ollama's Qwen support works for basic text generation, but there are six active issues that range from annoying to deal-breaking depending on your use case. All of these are documented on GitHub with open issue threads.

1. Speed: 5-7x Slower Than llama.cpp

This is the biggest practical gap. On identical hardware with the same model and quantization, Ollama delivers roughly 15-20 tokens per second where llama.cpp hits 80-100+ tok/s. The difference isn't subtle — it's the difference between usable real-time chat and watching paint dry on longer outputs.

The root cause: Ollama hasn't fully optimized for the hybrid GatedDeltaNet architecture that Qwen 3.5 uses. The 75% linear attention layers that make these models efficient at the architecture level aren't getting the inference speedups they should. This is tracked in Issue #14579.

Workaround: There isn't one inside Ollama. If speed matters, llama.cpp is 2-5x faster on the same hardware right now. On Apple Silicon, MLX is roughly 2x faster than Ollama as well.

2. Tool Calling: Completely Broken for Qwen 3.5

If you're building an agent or any workflow that relies on function calling, stop here. Tool calling with Qwen 3.5 in Ollama is non-functional — and it's not one bug, it's three stacked on top of each other:

All three are tracked in Issue #14493. Until this is resolved, Qwen 3 models handle tool calling better in Ollama, or you can switch to llama.cpp where tool calling works correctly.

3. Vision: No Qwen 3.5 Support

Qwen 3.5 includes vision-capable models, but Ollama can't run them. The mmproj (multimodal projector) format is incompatible, so pulling a Qwen 3.5 vision variant simply fails.

Workaround: Use qwen3-vl:8b instead — it's the older Qwen 3 vision model, and it works fine in Ollama. You'll miss the 3.5 quality improvements, but at least image understanding is functional. For Qwen 3.5 vision specifically, llama.cpp has working support.

4. Out-of-Memory Crashes

Several users report sudden OOM kills, especially with larger models or long contexts. This is tracked in Issue #14557 and mostly affects versions before 0.17.5.

Fix: Update to Ollama v0.17.5 or later, and set these environment variables before launching:

export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0

Flash attention reduces peak memory by processing attention in chunks. The quantized KV cache (q8_0) cuts context memory roughly in half compared to the default FP16 cache. Together, they should eliminate most OOM crashes on cards with 8GB+ VRAM.

5. Thinking Mode Forced ON (800+ Invisible Tokens)

This one is sneaky. Ollama forces thinking mode on for all Qwen 3.5 models by default. Every response starts with 800+ invisible <think> tokens that you never see in the output but that add 30-60 seconds of latency to every single reply. Your model isn't slow — it's silently "thinking" before it starts generating visible text.

Workaround: You can control thinking mode in the chat session with /no_think to disable it or /think to re-enable it. In API calls, include a system prompt that explicitly says "Do not use thinking mode" — though results are inconsistent. A Modelfile with PARAMETER think false (if supported in your version) is more reliable.

6. Repetition Penalties Silently Ignored

Setting repeat_penalty, frequency_penalty, or presence_penalty in Ollama has no effect on Qwen 3.5 models. The parameters are accepted without error but silently discarded. This means you can't tune output diversity through the standard sampling knobs.

There's no clean workaround for this one. If you need fine-grained control over repetition behavior, use llama.cpp directly where all sampling parameters work as expected.

Diagram showing Ollama workflow for pulling and running Qwen models locally
Ollama's pull-and-run workflow. The simplicity is real — the issues above are the tradeoff.

Configuration and Tips

Custom Modelfile

A Modelfile lets you lock in your preferred settings so you don't have to pass them every time. Create a file called Modelfile with these contents:

FROM qwen3.5:9b

PARAMETER temperature 0.7
PARAMETER top_k 40
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """You are a helpful assistant. Respond concisely and accurately."""

Then build and run your custom model:

ollama create my-qwen -f Modelfile
ollama run my-qwen

Adjust num_ctx based on your VRAM. Higher context windows eat more memory — 8192 is safe for 8GB cards, bump to 16384 or 32768 if you have 16-24GB. Going above 32K with Ollama often triggers the OOM issues mentioned above.

Essential Commands

Command What It Does
ollama list Show all downloaded models and their sizes
ollama pull qwen3.5:9b Download a model without starting chat
ollama rm qwen3.5:9b Delete a model to free disk space
ollama cp qwen3.5:9b my-backup Clone a model under a new name
ollama show qwen3.5:9b Display model metadata, template, and parameters

Prevent OOM Crashes

If you're hitting memory limits, set these before starting Ollama:

# Enable flash attention (lower peak memory)
export OLLAMA_FLASH_ATTENTION=1

# Quantize KV cache to 8-bit (halves context memory)
export OLLAMA_KV_CACHE_TYPE=q8_0

On Windows, set these as environment variables through System Settings or in PowerShell with $env:OLLAMA_FLASH_ATTENTION="1".

OpenAI-Compatible API

Ollama exposes an API on localhost:11434 that's compatible with the OpenAI client format. This means any tool that supports custom OpenAI endpoints — LM Studio, Open WebUI, SillyTavern, your own scripts — can connect directly:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5:9b",
    "messages": [{"role": "user", "content": "Explain transformers in 3 sentences."}]
  }'

No API key needed for local access. The server handles concurrent requests, so you can serve multiple clients or browser tabs simultaneously.

Controlling Thinking Mode

During an interactive chat session, type /no_think to disable the hidden reasoning step and get faster responses. Type /think to re-enable it when you need deeper reasoning on a hard problem. Remember: with thinking enabled, every response burns 800+ tokens you never see, adding 30-60 seconds of latency.

When to Use Ollama vs When to Avoid It

Use Ollama when:
  • You want the simplest possible setup — one command, done
  • You need an OpenAI-compatible API for local serving
  • You're serving multiple users or UIs simultaneously
  • Text-only inference is all you need (no vision, no tool calling)
  • You're evaluating different Qwen model sizes quickly
Avoid Ollama when:
  • Speed matters — llama.cpp is 2-5x faster on the same hardware
  • You need tool calling or function calling (broken for Qwen 3.5)
  • You need vision/image understanding with Qwen 3.5
  • You're on Apple Silicon — MLX is roughly 2x faster
  • You need precise control over sampling parameters
  • You're deploying to production with strict latency requirements

The honest summary: Ollama trades performance and feature completeness for convenience. That's a great deal if you're experimenting, prototyping, or just want to chat with Qwen locally. It's a bad deal if your workflow depends on any of the broken features listed above. Check the Ollama GitHub issues periodically — the team is actively working on Qwen support, and several of these problems may be fixed in future releases.

From the Community

Here's what developers running Qwen on Ollama are actually experiencing:

That 14 tok/s on M4 Mini is real — but the forced thinking mode is exactly the issue we documented above. The invisible tokens silently eat your latency budget without any visible benefit on simple queries.

This was true at launch. Ollama has since added MoE support in newer versions, but the tweet captures something important: Ollama tends to lag behind llama.cpp on new architecture support. If a new Qwen model drops and Ollama can't load it, llama.cpp built from source is usually the first to work.

Frequently Asked Questions

Why is Ollama so much slower than llama.cpp with Qwen models?

Qwen 3.5 uses a hybrid GatedDeltaNet architecture where 75% of layers run linear attention instead of standard quadratic attention. llama.cpp has optimized kernels for this. Ollama, which uses llama.cpp under the hood but adds its own serving layer, hasn't fully optimized for the hybrid path yet. The result is a 2-5x speed penalty. Track progress on Issue #14579.

Can I use Qwen 3.5 vision models in Ollama?

No. The multimodal projector format for Qwen 3.5 vision models isn't compatible with Ollama yet. Your options are: use qwen3-vl:8b (the older Qwen 3 vision model that does work), or run Qwen 3.5 vision through llama.cpp which has full support.

How do I fix OOM (out-of-memory) errors?

Three steps: first, update to Ollama v0.17.5+ which fixed several memory leaks. Second, enable flash attention with export OLLAMA_FLASH_ATTENTION=1. Third, quantize the KV cache with export OLLAMA_KV_CACHE_TYPE=q8_0. If you still hit OOM after all three, your model is genuinely too large for your VRAM — try a smaller variant or check what fits with our hardware compatibility tool.

Is tool calling fixed yet?

As of March 2026, no. The three underlying bugs (unclosed XML, missing generation prompt, unimplemented penalty sampling) are all tracked in Issue #14493. Check that thread for the latest status. Qwen 3 (not 3.5) models have better tool calling support in Ollama if you need it now.

Which Qwen model should I start with in Ollama?

qwen3.5:9b for most people. It fits in 8GB VRAM, runs at usable speeds even with Ollama's overhead, and delivers surprisingly strong reasoning for its size. If you have 24GB, the qwen3.5:35b-a3b MoE model is the next step up — 35B parameters but only 3B active per token, so it's faster than you'd expect.