AI Glossary
A practical glossary of AI and local LLM terms you'll run into when working with Qwen models. Each definition includes Qwen-specific context where it matters — not just what a term means, but how it applies to the models you're actually running. If you're setting up Qwen on your own hardware for the first time, pair this page with our Can I Run Qwen? tool and the local deployment guide.
A — E
Active Parameters
The number of parameters that actually compute for each token in an MoE model. The rest exist in memory but sit idle. This distinction matters because active parameters determine inference speed, while total parameters determine VRAM requirements and the model's quality ceiling.
In Qwen: Qwen3.5 35B-A3B has 35 billion total parameters but only 3 billion activate per token — so it runs at speeds comparable to a 3B dense model while delivering quality closer to a much larger one. You still need VRAM for all 35B, though.
Apache 2.0 License
One of the most permissive open-source licenses available, used by Qwen3 and Qwen3.5 models. You can use these models for any commercial purpose — products, SaaS, internal tools — modify them freely, and distribute the results. The only obligations: include the original copyright notice, state significant changes, and include the license file.
A key advantage over some other open licenses: Apache 2.0 includes explicit patent grants. Contributors give you a license to any patents necessary to use the software, which protects you from patent claims down the line.
AWQ (Activation-Aware Weight Quantization)
A GPU-only quantization method that preserves the weights most important for model accuracy. AWQ analyzes which weights matter most by observing activation patterns, then keeps those at higher precision. The result: slightly better quality than GPTQ at the same bit level, especially at 4-bit.
AWQ files use the .safetensors format and work with GPU inference engines like vLLM and Transformers. If you're serving Qwen from a dedicated GPU and want the best quality-per-bit ratio, AWQ is worth considering over GPTQ. For CPU/GPU hybrid setups, stick with GGUF instead.
Base Model vs Instruct Model
A base model is trained on raw text prediction; an instruct model is further trained to follow instructions and hold conversations. Base models are great at completing text but won't reliably answer questions or follow directions. Instruct models go through additional training (RLHF, DPO, or supervised fine-tuning) to become the helpful chatbots you actually interact with.
In Qwen: On HuggingFace, Qwen3-32B is the base model and Qwen3-32B-Instruct is the chat-ready version. For local use, you almost always want the Instruct variant. Base models are mainly useful if you're doing your own fine-tuning.
BF16 / FP16 / FP8
Number formats that define how precisely model weights are stored. FP16 (16-bit floating point) is the standard full-precision format — a 7B model at FP16 takes about 14GB. BF16 (Brain Float 16) uses the same 16 bits but allocates them differently: wider dynamic range at the cost of some decimal precision, which works better for training. FP8 (8-bit) is a hardware-accelerated format on newer GPUs like the H100 and RTX 4090, cutting memory roughly in half with minimal quality loss.
For most local users, these matter mainly as the "ceiling" — BF16/FP16 is as good as the model gets. Everything below is a trade-off between quality and VRAM. See quantization.
Context Window
The maximum number of tokens a model can process in a single conversation, including both your input and its output. A 128K context window means roughly 96,000 words — enough for an entire novel or a large codebase. But longer contexts consume more VRAM (the KV cache grows linearly) and can slow down generation.
In Qwen: Qwen3 models support 32K-128K depending on size. Qwen3.5 pushes this much further — Flash and Plus variants handle up to 1 million tokens. Keep in mind that "supports 128K" doesn't mean you should always use 128K. Shorter contexts are faster and cheaper.
Dense Model
A model where every parameter activates for every token. All neurons fire on every forward pass — no routing, no skipping. Qwen3 8B, Qwen3 32B, and Llama 3 70B are all dense models. The upside: simpler architecture, and memory requirements directly match the compute you're using. The downside: you can't scale to hundreds of billions of parameters without enormous VRAM. That's where MoE comes in.
Embedding Model
A model that converts text into numerical vectors for search, similarity comparison, and RAG retrieval. Unlike generation models (Qwen3 32B, GPT-4), embedding models don't produce text output. They take a sentence or paragraph and return an array of numbers that represents its meaning. Two texts about similar topics will have vectors pointing in similar directions.
In Qwen: The Qwen3-Embedding series includes 4B and 8B models. They rank competitively on MTEB benchmarks and work well as the retrieval backbone in RAG pipelines.
EXL2
A GPU-only quantization format used by ExLlamaV2 that assigns different bit levels to different layers. Instead of quantizing the entire model to, say, 4 bits uniformly, EXL2 can use 3 bits for less important layers and 5 bits for critical ones. This variable approach often produces the best quality-to-size ratio of any quantization method.
The catch: EXL2 only works with ExLlamaV2 and requires a full GPU — no CPU offloading. If you've got a powerful GPU with enough VRAM and want the absolute best quality per gigabyte, EXL2 is hard to beat. For flexibility, GGUF remains the safer choice.
F — K
Fine-Tuning
The process of training an existing model on your own data to specialize it for a specific task or domain. Instead of training from scratch (which costs millions), you take a pre-trained model and adjust its weights using a much smaller dataset. The result: a model that retains general knowledge but excels at your specific use case — medical terminology, legal documents, your company's coding style, whatever you need.
Full fine-tuning updates every parameter and requires serious hardware. Most people use LoRA or QLoRA instead, which makes the process feasible on a single consumer GPU.
Flash Attention
An optimized implementation of the attention mechanism that cuts memory usage and speeds up computation by 2-8x. Standard attention creates a massive matrix that scales quadratically with sequence length. Flash Attention avoids this by processing attention in small tiles, minimizing expensive data transfers between GPU memory levels.
You rarely need to think about Flash Attention directly — vLLM, llama.cpp, and most modern inference engines enable it automatically when running Qwen models. It's especially important for long contexts, where the speedup is most dramatic.
GatedDeltaNet
A linear attention variant that combines delta rule memory updates with Mamba-style gating, scaling linearly with sequence length instead of quadratically. Standard attention gets expensive fast as context grows — processing 1 million tokens with full attention would be prohibitive. GatedDeltaNet sidesteps this by compressing past context into a fixed-size memory state that updates incrementally.
In Qwen: Qwen3.5 uses a hybrid architecture where roughly 75% of layers run GatedDeltaNet and 25% use standard full attention, repeating in a pattern. The full attention layers handle retrieval and global context; the linear layers keep inference efficient. This hybrid design is what makes Qwen3.5's million-token context window practical rather than theoretical.
GGUF
The standard file format for running quantized LLMs locally. A single .gguf file contains everything needed: weights, architecture metadata, tokenizer, and quantization parameters. Created by the llama.cpp project, GGUF replaced the older GGML format in 2023 and is now the default for local inference.
In Qwen: If you're running Qwen via Ollama, LM Studio, or llama.cpp, you're using GGUF files. All Qwen models are available as GGUF on HuggingFace, uploaded by the Qwen team or community quantizers like Bartowski and Unsloth. GGUF's killer feature: it supports CPU/GPU hybrid inference, so you can split a model across your GPU and system RAM when VRAM alone isn't enough.
GPQA Diamond
A benchmark of PhD-level science questions in biology, chemistry, and physics that even domain experts only answer correctly 65-74% of the time. GPQA tests genuine reasoning ability, not just pattern matching or knowledge retrieval. When a model scores 80%+ on GPQA Diamond, it's demonstrating scientific understanding that goes beyond what most humans can match without specialized training.
GPTQ
A post-training quantization method designed for GPU-only inference. GPTQ compresses model weights (typically to 4-bit) using a one-shot calibration process. It produces .safetensors files that work with vLLM, Transformers, and other GPU inference frameworks.
GPTQ was one of the first practical quantization methods for large models and remains widely used. AWQ generally offers slightly better quality at the same bit level, though. For local CPU/GPU hybrid setups, GGUF is the better choice since GPTQ is GPU-only.
HuggingFace
The central hub for downloading AI models, datasets, and tools — think of it as GitHub for machine learning. Most Qwen models are published on HuggingFace first. The platform hosts model cards with benchmarks, provides the transformers library for running models in Python, and supports direct downloads of GGUF, GPTQ, and AWQ quantizations. Free to use, and you don't need an account to download most models.
Imatrix (Importance Matrix)
A calibration technique that identifies the most important model weights before quantization, preserving them at higher precision. An importance matrix is computed by running sample data through the model and measuring which weights have the biggest impact on output quality. During quantization, those weights get priority treatment.
The difference is most noticeable at aggressive quantization levels. A Q3 quantization with imatrix can approach Q4 quality, while a Q3 without it degrades noticeably. When downloading Qwen GGUFs from community quantizers, look for "imatrix" in the description — most high-quality uploads from Bartowski use it by default.
Inference vs Training
Training teaches a model; inference uses it. Training is the process of adjusting billions of parameters over weeks or months using thousands of GPUs — that's Alibaba's job. Inference is what happens when you type a prompt and get a response, whether locally or through an API. When people talk about "running" a model or optimizing performance, they mean inference.
KV Cache
A memory buffer that stores pre-computed attention states so the model doesn't reprocess the entire conversation for every new token. Without a KV cache, generating token #500 would require recomputing attention across all 499 previous tokens from scratch. The cache makes generation fast — but it grows linearly with conversation length, quietly eating VRAM.
This is the hidden VRAM consumer that catches people off guard. Your model loads fine, you start chatting, and 20 messages later you hit an out-of-memory error. That's the KV cache filling up. For a 32B model at FP16, expect roughly 1GB per 1K tokens of context. Solutions: shorten context, use a smaller model, or enable Q8 KV cache (halves the memory cost).
L — P
LiveCodeBench
A rolling coding benchmark that adds fresh competitive programming problems monthly, making it resistant to data contamination. Unlike static benchmarks where models might have seen the answers during training, LiveCodeBench uses problems published after each model's training cutoff. This makes it one of the most trustworthy measures of genuine coding ability.
llama.cpp
The foundational C/C++ library that powers most local LLM inference. Created by Georgi Gerganov, llama.cpp defined the GGUF format and supports CPU, GPU, and hybrid inference. Most tools you've heard of — Ollama, LM Studio — are built on top of it. If you want maximum control over how you run Qwen locally, llama.cpp is the engine underneath everything.
LM Studio
A desktop application for running LLMs locally with a graphical interface. Download models from HuggingFace, configure settings, and chat — no command line required. LM Studio uses llama.cpp under the hood and supports GGUF models. It's the easiest entry point for people new to local AI. See our guide to running Qwen in LM Studio.
LoRA / QLoRA
Parameter-efficient fine-tuning methods that let you customize a model without retraining every weight. LoRA freezes the original model and trains small adapter matrices — typically only 0.1-1% of additional parameters. QLoRA goes further: it loads the base model in 4-bit precision while training the adapters at higher precision, making it possible to fine-tune a 32B model on a single 24GB GPU.
In Qwen: LoRA and QLoRA are the go-to methods for customizing Qwen models. Popular tools include Unsloth (roughly 2x faster than standard LoRA), Axolotl, and LLaMA-Factory. The Qwen team provides official fine-tuning documentation on their GitHub.
Min-P
A sampling parameter that filters out low-probability tokens relative to the highest-probability token. If Min-P is set to 0.05, any token with probability less than 5% of the top token's probability gets discarded. This approach adapts dynamically — when the model is confident, few tokens pass the filter; when it's uncertain, more options remain. Many users find Min-P produces better results than Top-P alone.
MLX
Apple's machine learning framework optimized for Apple Silicon chips (M1 through M4). MLX takes advantage of the unified memory architecture where CPU and GPU share the same RAM, eliminating the VRAM bottleneck that limits most PCs. A MacBook Pro with 36GB unified memory can run models that would require a dedicated 36GB GPU on a PC.
The mlx-community organization on HuggingFace provides ready-to-use Qwen models in MLX format. If you're on a Mac, MLX is your best option for local inference — faster than llama.cpp on Apple hardware in most scenarios.
MMLU / MMLU-Pro
Broad knowledge benchmarks that test a model across dozens of academic subjects. MMLU covers 57 topics from STEM to humanities. MMLU-Pro is the harder version with more complex questions and 10 answer choices instead of 4. Most frontier models now score 85%+ on standard MMLU, which makes MMLU-Pro the more useful benchmark for differentiating top models.
MoE (Mixture of Experts)
An architecture where the model has many "expert" sub-networks but only activates a few per token, delivering big-model quality at small-model speed. A learned router decides which experts handle each token. The rest stay idle. This means MoE models are faster than dense models of the same total size — but they still need all parameters loaded into memory.
In Qwen: Several Qwen models use MoE. Qwen3 235B-A22B has 235B total parameters with 128 experts, but only 8 activate per token (22B active). Qwen3.5 35B-A3B activates just 3B per token. The critical gotcha: you need VRAM for ALL parameters, not just the active ones. A 235B MoE model needs roughly as much memory as a 235B dense model — it's just faster per token.
MTEB
Massive Text Embedding Benchmark — the standard leaderboard for comparing embedding models. MTEB evaluates across 8 tasks: classification, clustering, pair classification, reranking, retrieval, semantic textual similarity, summarization, and bitext mining. If you're choosing an embedding model for a RAG pipeline, MTEB scores are the primary reference point.
Multimodal Model
A model that processes multiple input types — text, images, video, or audio — not just text. Vision-Language Models (VLMs) are the most common variant, handling both images and text. You can feed them a screenshot, a chart, or a photo and ask questions about what they see.
In Qwen: Qwen3-VL handles images and video. Qwen3-Omni processes text, images, video, and audio in a single model. Specialized variants exist too: Qwen3-ASR for speech recognition and Qwen3-TTS for text-to-speech.
Ollama
The simplest way to run LLMs locally — one command to download and start chatting. Install Ollama, run ollama run qwen3:32b, and you're done. It handles model downloading, quantization, and serving automatically using llama.cpp under the hood. Ollama supports GGUF models and works on macOS, Linux, and Windows.
For a step-by-step walkthrough, see our guide to running Qwen with Ollama.
Parameters
The learnable weights in a neural network, measured in billions (B). "32B" means 32 billion parameters. More parameters generally means a more capable model — but also more memory and compute. The memory math is straightforward: at FP16, each parameter takes 2 bytes, so a 7B model needs about 14GB. At Q4_K_M quantization, that drops to roughly 0.6 bytes per parameter — about 4.2GB for the same 7B model.
Qwen model sizes: 0.6B, 1.7B, 4B, 8B, 14B, 32B, 72B (dense), plus MoE variants at 30B-A3B, 35B-A3B, 235B-A22B, and 397B-A17B.
Presence Penalty / Repetition Penalty
Parameters that discourage the model from repeating itself. Presence penalty applies a flat penalty to any token that has appeared at all; repetition penalty scales the penalty based on how often a token has appeared. Cranking these too high makes output erratic and unnatural. Too low and the model can get stuck in loops, especially smaller models.
A typical starting point: repetition penalty of 1.05-1.15. Leave at 1.0 (off) if you're not seeing repetition issues.
Prompt Processing vs Token Generation
Two distinct phases of inference with very different speed profiles. Prompt processing (prefill) reads your entire input at once — it's fast and parallelizable, often exceeding 1000 tok/s. Token generation (decode) produces output one token at a time, each depending on all previous tokens. This is the bottleneck, and it's what people mean when they report tok/s speeds.
Practical takeaway: pasting a long document into a prompt doesn't slow down the response nearly as much as you'd expect. Processing input is cheap; generating output is expensive.
Q — T
Quantization
The process of reducing the precision of model weights to shrink memory usage and speed up inference, at a small cost to accuracy. Think of it as JPEG compression for AI models — you trade a bit of quality for dramatically smaller file sizes. A 7B model at full precision takes about 14GB. At Q4_K_M (the sweet spot for most users), that drops to around 4GB.
A counterintuitive rule worth remembering: a larger model at lower quantization usually outperforms a smaller model at full precision. Qwen3 32B at Q4_K_M typically beats Qwen3 14B at FP16 — you're better off checking what fits your GPU and going as big as you can at Q4 or above.
Common quantization levels: Q2_K (extreme compression, quality suffers), Q3_K_M (usable but noticeable loss), Q4_K_M (the sweet spot — barely noticeable degradation), Q5_K_M (very close to original), Q6_K (near-lossless), Q8_0 (virtually identical to full precision). See the quality ladder table below.
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in real documents by retrieving relevant information before generating an answer. The pipeline: your question gets converted to a vector by an embedding model, similar documents are found in a vector database, and those documents are fed to the LLM as context. The model then generates a response based on actual data rather than relying solely on training knowledge.
RAG addresses two fundamental LLM limitations: outdated training data (your knowledge base updates in real time) and hallucination (responses cite real documents). For building RAG with Qwen, the Qwen3-Embedding models handle the retrieval side while any Qwen generation model handles the response.
Reranker
A model that re-scores a list of candidate documents by processing query and document together, producing more accurate relevance rankings than embedding similarity alone. In a RAG pipeline, you'd typically retrieve the top 100 candidates fast using embeddings, then rerank to get the best 10. Slower but significantly more precise.
Sparse Model
A model where only a fraction of parameters activate per token — the opposite of dense. All MoE models are sparse. The non-active parameters still exist in memory; they just don't compute for that particular token. "Sparse" refers to the activation pattern, not the memory footprint.
SWE-bench
A benchmark that tests real-world software engineering ability using actual GitHub issues from popular Python repositories. The model must read the codebase, understand the bug, and write a working fix. SWE-bench Verified is the curated subset that filters out ambiguous or poorly specified issues. This is one of the most practical benchmarks — if a model scores well here, it can genuinely help with real code.
Temperature
Controls how random or focused the model's output is. Low temperature (0.1-0.3) makes the model stick to the most probable tokens — good for coding and factual answers where you want consistency. High temperature (0.7-0.9) lets the model explore less likely options — better for creative writing and brainstorming. Temperature 0 is fully deterministic: same input always produces the same output.
Starting points for Qwen: 0.1-0.3 for code and factual work, 0.6 for general chat, 0.7-0.9 for creative writing.
Thinking Mode (Chain-of-Thought)
A mode where the model reasons step-by-step before giving its final answer, significantly improving accuracy on complex problems. The reasoning process appears as a visible "thinking" block. On hard math and logic problems, thinking mode can improve accuracy by 30-50% — a massive difference.
In Qwen: Qwen3 models support hybrid thinking — toggle it with enable_thinking=true/false. QwQ is Qwen's dedicated reasoning model that always thinks. The trade-off: thinking tokens count toward your usage, making responses slower and more expensive on API. For simple questions, turn it off. For math, code debugging, or multi-step reasoning, keep it on.
tok/s (Tokens per Second)
The standard measure of how fast a model generates output. One token is roughly 3/4 of an English word. Below 5 tok/s feels painfully slow. Around 15-30 tok/s is comfortable for interactive chat. Above 60 tok/s, the model outputs faster than you can read.
Qwen reference speeds: Qwen3 8B at Q4_K_M on an RTX 4090 typically hits 60-80 tok/s. Qwen3 32B Q4_K_M on the same GPU: around 20-30 tok/s. Qwen3 235B-A22B Q4_K_M with dual RTX 4090s: roughly 8-15 tok/s. See the speed reference table below.
Top-K
A sampling parameter that limits the model to considering only the K most probable next tokens. Top-K of 40 means only the 40 highest-probability tokens are candidates at each step, everything else is discarded. Top-K of 1 is greedy decoding — always pick the single most likely token. Higher values produce more diverse output.
Top-P (Nucleus Sampling)
A sampling parameter that considers the smallest set of tokens whose cumulative probability reaches P. Top-P of 0.9 means: include tokens until their combined probability hits 90%, then discard the rest. Unlike Top-K, this adapts to the probability distribution — when the model is confident, few tokens qualify; when it's uncertain, more pass through.
Transformer
The neural network architecture behind virtually all modern LLMs, including every Qwen model. Introduced in 2017's "Attention Is All You Need" paper, transformers process input through layers of self-attention (letting each token "attend" to every other token) and feed-forward networks. The self-attention mechanism is what gives LLMs their ability to understand context and relationships across long text sequences.
V
vLLM
A high-throughput inference engine built for serving LLMs at scale from GPUs. vLLM uses PagedAttention to manage KV cache memory efficiently, preventing waste from fragmentation. It's the go-to choice for running Qwen as an API endpoint on a dedicated GPU server — not for casual local use (that's Ollama's territory), but for production serving where you need to handle multiple concurrent requests.
VRAM vs RAM
VRAM is fast GPU memory (up to 1 TB/s bandwidth); RAM is slower system memory (50-80 GB/s). LLM inference is memory-bandwidth limited, which means layers loaded in VRAM process tokens 10-20x faster than layers offloaded to RAM. When a model doesn't fit entirely in VRAM, llama.cpp can split it across both — it works, just slower.
Rule of thumb: model size in GB at your quantization level + 1-2GB for KV cache and overhead. A Qwen3 8B at Q4_K_M takes about 5GB on disk, so you'll want at least 6-7GB of VRAM for a comfortable fit. Check Can I Run Qwen? for specific GPU recommendations.
Quick Reference Tables
Quantization Quality Ladder
Sizes shown for a 7B-parameter model. Scale proportionally for larger models.
| Quant Level | Quality vs Full | Size (7B) | When to Use |
|---|---|---|---|
| BF16 | 100% (baseline) | ~14 GB | Unlimited VRAM, maximum fidelity |
| Q8_0 | ~99% | ~7.5 GB | Quality-first users with VRAM to spare |
| Q6_K | ~97% | ~5.5 GB | Production workloads, near-lossless |
| Q5_K_M | ~95% | ~4.8 GB | Coding and reasoning tasks |
| Q4_K_M | ~92% | ~4.1 GB | Sweet spot for most users |
| Q3_K_M | ~85% | ~3.2 GB | VRAM-limited setups |
| Q2_K | ~75-80% | ~2.5 GB | Last resort only |
Typical tok/s Ranges
| Speed | Rating | What It Feels Like |
|---|---|---|
| 1-5 tok/s | Slow | Barely usable for chat — you'll be watching words appear one by one |
| 5-15 tok/s | Acceptable | Usable with patience, fine for batch or background tasks |
| 15-30 tok/s | Good | Comfortable for interactive chat |
| 30-60 tok/s | Fast | Real-time feel, smooth experience |
| 60+ tok/s | Very fast | Faster than you can read — effectively instant |
VRAM Quick Guide (Q4_K_M)
| VRAM | Qwen Models That Fit |
|---|---|
| 4 GB | 0.6B, 1.7B — small models only |
| 8 GB | Up to 8B (tight fit, short context recommended) |
| 12 GB | 8B comfortably, 14B with partial offload |
| 16 GB | Up to 14B (tight), 8B with full context headroom |
| 24 GB | Up to 32B (tight), 14B comfortably, 35B-A3B MoE |
| 48 GB+ | 32B comfortably, 235B-A22B with partial offload |
Want to check exactly what runs on your specific GPU? Use our Can I Run Qwen? tool for personalized recommendations, or read the full guide to running Qwen locally.