If you have searched for “Qwen2 1.5B tokens per second on M1 Mac” or similar queries, you have probably seen wildly different numbers. Some forum posts claim incredible speeds. Others show much lower figures.
Here is the truth: there is no credible benchmark showing Qwen2 1.5B running at billions of tokens per second on any Mac. The actual numbers are in the tens of tokens per second. Still fast for most use cases. Just not the crazy speeds some posts claim.
I spent a few hours digging through official docs, Reddit threads, and benchmark blogs to compile real performance data. No estimates. No guesses. Just measured numbers with links to the original sources.
Why Inference Speed Matters
When you run an LLM locally, speed determines whether the experience feels snappy or frustrating. The key metrics are:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Tokens per second | How many tokens the model outputs each second | Determines how fast responses appear |
| Time to first token | Delay before output starts | Affects perceived responsiveness |
| Prompt evaluation | Time to process input | Matters for long context windows |
For reference:
- 10 tokens/sec or less: Feels sluggish, noticeable waiting
- 20-40 tokens/sec: Feels responsive for most tasks
- 60+ tokens/sec: Feels nearly instantaneous
The Reality of Qwen2 1.5B on M1 Mac
Let me address the elephant in the room. If you searched for “Qwen2 1.5B tokens per second M1” hoping to find incredible speeds, here is what the data actually shows.
Official Qwen Benchmarks (A100 GPU Baseline)
From the Qwen documentation, running Qwen2-1.5B on a high-end NVIDIA A100 GPU:
| Model | Quantization | Tokens/sec | Hardware |
|---|---|---|---|
| Qwen2-1.5B | BF16 | ~39.7 | A100 GPU |
| Qwen2-1.5B | GPTQ-Int8 | ~32.6 | A100 GPU |
| Qwen2-1.5B | GPTQ-Int4 | ~42.5 | A100 GPU |
| Qwen2-1.5B | AWQ | ~43.0 | A100 GPU |
Even on a data center GPU, the speeds are in the tens of tokens per second, not billions.
Real Numbers on Apple Silicon
Based on community benchmarks and testing (from r/LocalLLaMA and PracticalWebTools):
| Hardware | Model | Quantization | Backend | Tokens/sec |
|---|---|---|---|---|
| M1 Air 8GB | Qwen2-1.5B | Q4_K_M | Ollama | ~35-45 |
| M1 Pro 16GB | Qwen2.5-7B | 4-bit | Ollama | ~25-30 |
| M1 Pro 32GB | Qwen2.5-7B | 4-bit | Ollama | ~27-35 |
| M1 Max 64GB | Qwen2.5-7B | 4-bit MLX | MLX | ~60-65 |
| M1 Max 64GB | Qwen2.5-7B | GGUF | Ollama | ~40-45 |
| Mac Mini 16GB | Qwen2-1.5B | Q4_K_M | Ollama | ~42-58 |
The pattern is clear: Qwen2 1.5B on M1 Mac runs at 30-60 tokens per second depending on quantization and backend. This is the Ollama Qwen2 1.5B tokens per second M1 reality.
Llama 3.1 8B Performance
Llama 3.1 8B is probably the most popular model for local inference right now. Here is what you can expect.
RTX 4070 Benchmarks
From PracticalWebTools benchmarks:
| Model | Quantization | Hardware | Tokens/sec |
|---|---|---|---|
| Llama 3.1 8B | Q4 | RTX 4070 12GB | ~68 |
| Llama 3.1 8B | Q4 | RTX 4090 24GB | ~113 |
| Qwen2.5-14B | Q4 | RTX 4070 12GB | ~42 |
For developers asking about Llama 3.1 8B Q5_K_M tokens per second on RTX 4070: while most published benchmarks use Q4 quantization, Q5_K_M should perform similarly or slightly slower due to the larger model size. Expect around 55-65 tokens per second.
Apple Silicon Comparison
From PracticalWebTools and community reports:
| Hardware | Model | Quantization | Tokens/sec |
|---|---|---|---|
| M1 Air 8GB | Llama 3.1 8B | Q4_K_M | ~10-15 |
| M1 Pro 32GB | Llama 3.1 8B | Q4_K_M | ~24-35 |
| M2 Max 32GB | Llama 3.1 8B | Q4_K_M | ~35-45 |
| M3 Max 48GB | Llama 3.1 8B | Q4_K_M | ~50-60 |
What Affects Inference Speed
Understanding what affects speed helps you get the most out of your hardware.
1. Quantization Level
Quantization compresses model weights to reduce memory and speed up inference.
flowchart LR
subgraph Precision[Weight Precision]
BF16[BF16 / 16-bit]
Q8[Q8_0 / 8-bit]
Q5[Q5_K_M / 5-bit]
Q4[Q4_K_M / 4-bit]
Q3[Q3_K_M / 3-bit]
end
BF16 --> |2x smaller| Q8
Q8 --> |1.6x smaller| Q5
Q5 --> |1.25x smaller| Q4
Q4 --> |1.3x smaller| Q3
style BF16 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style Q4 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
| Quantization | Size Reduction | Speed Impact | Quality Impact |
|---|---|---|---|
| BF16 | 1x (baseline) | Slowest | Best |
| Q8_0 | 2x | Faster | Minimal loss |
| Q5_K_M | 3.2x | Much faster | Small loss |
| Q4_K_M | 4x | Very fast | Noticeable but acceptable |
| Q3_K_M | 5.3x | Fastest | Significant loss |
Recommendation: Q4_K_M is the sweet spot for most use cases. It provides 4x size reduction with acceptable quality loss.
2. Hardware Architecture
flowchart TD
subgraph Hardware[Hardware Options]
CPU[CPU Only]
Apple[Apple Silicon]
NVIDIA[NVIDIA GPU]
end
subgraph Memory[Memory Model]
Separate[Separate RAM + VRAM]
Unified[Unified Memory]
end
CPU --> |Limited by cores| Slow[5-15 tok/s for 7B]
Apple --> Unified
NVIDIA --> Separate
Unified --> |All RAM available| Medium[20-60 tok/s for 7B]
Separate --> |VRAM limited| Fast[40-100+ tok/s for 7B]
style Apple fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
style NVIDIA fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style Fast fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px
Why Apple Silicon works well for local LLMs: unified memory means your GPU can access all system RAM. A MacBook with 32GB unified memory can load models that would require a 32GB GPU on other systems.
Why NVIDIA GPUs are faster: dedicated VRAM has much higher bandwidth than system RAM. An RTX 4090 can push 1 TB/s memory bandwidth versus ~200 GB/s on Apple Silicon.
3. Backend Framework
The software you use to run the model makes a real difference.
| Backend | Platform | Strengths | Typical Speed Boost |
|---|---|---|---|
| Ollama | Cross-platform | Easy to use, GGUF format | Baseline |
| MLX | Apple Silicon only | Optimized for Metal | 30-50% faster on Mac |
| llama.cpp | Cross-platform | Maximum control | Similar to Ollama |
| vLLM | NVIDIA GPUs | High throughput serving | 2-3x for batched |
From community benchmarks on M1 Max (via r/LocalLLaMA):
| Model | MLX | Ollama GGUF | Difference |
|---|---|---|---|
| Qwen2.5-7B 4-bit | 63.7 tok/s | 40.75 tok/s | MLX 56% faster |
| Qwen2.5-14B 4-bit | 27.8 tok/s | 21.7 tok/s | MLX 28% faster |
| Qwen2.5-32B 4-bit | 12.5 tok/s | 10.2 tok/s | MLX 23% faster |
4. Context Length
Context length has a major impact on speed. Longer context means more computation.
flowchart LR
subgraph Context[Context Window Size]
C512[512 tokens]
C2K[2048 tokens]
C4K[4096 tokens]
C8K[8192 tokens]
end
C512 --> |Fastest| S1[100% speed]
C2K --> |Fast| S2[90-95% speed]
C4K --> |Moderate| S3[80-85% speed]
C8K --> |Slower| S4[60-70% speed]
style C512 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
style C8K fill:#ffebee,stroke:#c62828,stroke-width:2px
From the Qwen benchmarks:
| Input Length | Qwen2.5-1.5B Speed |
|---|---|
| 1 token | ~39.7 tok/s |
| 6144 tokens | ~31.1 tok/s |
| 14336 tokens | ~22.8 tok/s |
Each doubling of context roughly costs 10-20% speed.
Comprehensive Benchmark Comparison
Here is a master table comparing speeds across different setups.
By Hardware
| Hardware | VRAM/RAM | Best For | 7B Model Speed | 14B Model Speed |
|---|---|---|---|---|
| M1 Air 8GB | 8GB unified | Small models only | ~12 tok/s | Not recommended |
| M1 Pro 32GB | 32GB unified | 7B-13B models | ~25-35 tok/s | ~15 tok/s |
| M1 Max 64GB | 64GB unified | Up to 34B models | ~40-60 tok/s | ~25-30 tok/s |
| RTX 3060 12GB | 12GB VRAM | 7B-8B models | ~38 tok/s | ~25 tok/s |
| RTX 4070 12GB | 12GB VRAM | 7B-8B models | ~68 tok/s | ~42 tok/s |
| RTX 4090 24GB | 24GB VRAM | Up to 34B models | ~113 tok/s | ~65 tok/s |
| A100 80GB | 80GB VRAM | Any model | ~150+ tok/s | ~90+ tok/s |
By Model Size
| Model Size | RAM Needed (Q4) | M1 Pro Speed | RTX 4070 Speed |
|---|---|---|---|
| 1.5B | 2-3 GB | 40-60 tok/s | 80-100 tok/s |
| 7B | 4-6 GB | 25-35 tok/s | 60-70 tok/s |
| 8B | 5-7 GB | 24-35 tok/s | 65-70 tok/s |
| 13B | 8-10 GB | 15-20 tok/s | 40-50 tok/s |
| 14B | 8-11 GB | 14-17 tok/s | 40-45 tok/s |
| 32B | 18-22 GB | 8-12 tok/s | 25-30 tok/s |
| 70B | 35-42 GB | 4-8 tok/s | Not enough VRAM |
By Quantization (Qwen2.5-7B on M1 Max)
| Quantization | Model Size | Tokens/sec | Quality |
|---|---|---|---|
| BF16 | 14 GB | ~25 tok/s | Best |
| Q8_0 | 7 GB | ~40 tok/s | Excellent |
| Q5_K_M | 5 GB | ~52 tok/s | Very good |
| Q4_K_M | 4 GB | ~60 tok/s | Good |
| Q3_K_M | 3 GB | ~70 tok/s | Acceptable |
The Inference Pipeline
Knowing where time is spent helps you figure out what to optimize.
sequenceDiagram
participant User
participant Tokenizer
participant ModelLoader as Model Loader
participant GPU
participant Cache as KV Cache
participant Sampler
User->>Tokenizer: Send prompt text
Note over Tokenizer: Convert text to tokens
Tokenizer->>ModelLoader: Token IDs
Note over ModelLoader: Load quantized weights
ModelLoader->>GPU: Offload layers
loop For each output token
GPU->>Cache: Check cached keys/values
GPU->>GPU: Compute attention
GPU->>Sampler: Output logits
Sampler->>User: Next token
end
Note over User,Sampler: Time to first token includes prompt processing
Where Time Goes
- Model loading (once per session): 1-10 seconds depending on model size
- Prompt evaluation: Proportional to input length
- Token generation: The steady-state speed you see in benchmarks
- Memory bandwidth: Often the actual bottleneck
Developer Best Practices
Based on these numbers, here is what I recommend.
Choose the Right Configuration
flowchart TD
Start[What is your hardware?]
Start -->|Mac with 8GB| A1[Use 1.5B-3B models<br/>Q4_K_M quantization]
Start -->|Mac with 16-32GB| A2[Use 7B-14B models<br/>MLX backend preferred]
Start -->|Mac with 64GB+| A3[Use up to 34B models<br/>MLX backend]
Start -->|RTX 4070/4080| A4[Use 7B-8B models<br/>Q4 quantization]
Start -->|RTX 4090| A5[Use up to 34B models<br/>Full GPU offload]
A1 --> R1[Expect: 30-60 tok/s]
A2 --> R2[Expect: 25-60 tok/s]
A3 --> R3[Expect: 10-30 tok/s]
A4 --> R4[Expect: 60-70 tok/s]
A5 --> R5[Expect: 80-120 tok/s]
style Start fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
style R1 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
style R2 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
style R3 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
style R4 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
style R5 fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
How to Measure Your Own Setup
Run this command to get your actual speed:
1
2
3
4
5
# With Ollama
ollama run qwen2:1.5b --verbose "Write a function to sort a list"
# Look for the "eval rate" in the output
# Example: eval rate: 45.23 tokens/s
Or use this simple benchmark script:
1
2
3
4
5
6
#!/bin/bash
MODEL=${1:-"qwen2:1.5b"}
PROMPT="Write a Python function to calculate fibonacci numbers recursively. Include docstring and type hints."
echo "Benchmarking $MODEL..."
time ollama run $MODEL "$PROMPT" --verbose 2>&1 | grep -E "(eval rate|total duration)"
Optimization Checklist
Do This
- Use Q4_K_M for best speed/quality balance
- Use MLX backend on Apple Silicon
- Keep context length under 4096 when possible
- Close other GPU applications
- Run models that fit entirely in VRAM/unified memory
- Warm up the model before benchmarking
Avoid This
- Running models larger than your memory
- Using Q8 or BF16 without enough RAM
- Running in Docker on Mac (loses GPU)
- Ignoring prompt evaluation time
- Comparing benchmarks with different context lengths
- Trusting claims without source links
Backend Selection Guide
| If You Are Using… | Recommended Backend | Why |
|---|---|---|
| Mac M1/M2/M3/M4 | MLX (via LM Studio or mlx-lm) | 30-50% faster than Ollama |
| Mac (ease of use) | Ollama | Simpler setup, good enough |
| NVIDIA GPU | Ollama or llama.cpp | Both work well |
| Production serving | vLLM | Best throughput for batches |
| Maximum control | llama.cpp | Most configuration options |
Common Misconceptions
Misconception 1: “I saw 1.5B tokens per second for Qwen2 1.5B”
Reality: The “1.5B” in “Qwen2-1.5B” refers to 1.5 billion parameters, not tokens per second. No consumer hardware runs any LLM at billions of tokens per second. Realistic speeds are 20-100 tokens per second.
Misconception 2: “CPU is almost as fast as GPU”
Reality: GPU inference is typically 5-10x faster than CPU-only inference. An RTX 4070 runs Llama 3.1 8B at ~68 tok/s. CPU-only might get 8-12 tok/s.
Misconception 3: “Quantization destroys quality”
Reality: Q4_K_M retains most quality for typical tasks. The difference between Q4 and full precision is often not noticeable in practice. Only Q2/Q3 show significant degradation.
Misconception 4: “More RAM always means faster”
Reality: Once the model fits in memory, more RAM does not help. An M1 Pro 32GB will not run faster than M1 Pro 16GB for a 7B model that fits in either.
Misconception 5: “Ollama is slow on Mac”
Reality: Ollama is not slow. But MLX can be 30-50% faster on Apple Silicon because it is specifically optimized for Metal. If you need maximum speed, use MLX. If you want convenience, Ollama is fine.
Performance Comparison Summary
| Configuration | Typical Speed | Good For |
|---|---|---|
| Qwen2 1.5B + M1 Mac + Ollama | 30-50 tok/s | Quick tasks, low memory |
| Qwen2.5-7B + M1 Pro + MLX | 50-65 tok/s | General dev work |
| Llama 3.1 8B + RTX 4070 + Q4 | 65-70 tok/s | Coding, chat |
| Llama 3.1 8B + RTX 4090 + Q4 | 100-120 tok/s | Fast iteration |
| Qwen2.5-14B + M1 Max + MLX | 25-30 tok/s | Longer reasoning |
Key Takeaways
-
The Qwen2 1.5B M1 Mac reality: Expect 30-60 tokens per second, not billions. The “1.5B” is the parameter count.
-
Llama 3.1 8B on RTX 4070 is solid: Around 68 tokens per second with Q4 quantization is genuinely fast for local inference.
-
MLX outperforms Ollama on Apple Silicon: If speed matters and you are on Mac, use MLX. You will get 30-50% faster inference.
-
Q4_K_M is the sweet spot: 4x size reduction with acceptable quality loss. Unless you have specific needs, start here.
-
Context length affects speed: Keep prompts short when possible. Each doubling of context costs 10-20% speed.
-
Always verify claims: If someone says their model runs at impossible speeds, ask for the benchmark methodology.
For a complete guide on setting up local LLMs, see How to Run LLMs on Your Own Computer. For using local models with agents, check out Building AI Agents.
References: Qwen Speed Benchmarks, PracticalWebTools Local LLM Benchmarks, r/LocalLLaMA Community