> TL;DR

Problem: Gemma 4 26B scored 2/10 in my first benchmark — 2 tok/s on Ollama, broken tool calling, unusable on 24GB.

Root cause: Ollama ships only Q4_K_M (18GB — won't fit), has 3 separate Gemma 4 bugs, and offers no smaller quant.

Fix: Unsloth Q3_K_XL (12GB) + llama.cpp = 49 tok/s, perfect tool calling, 3GB headroom.

Bonus: Built localcoder — a Claude Code-style CLI agent (867 lines of Python) that autonomously creates full-stack apps, connects to devices via ADB, and searches the web. All local, $0.00.

> Part 1: All Models Tested > Dev.to Version

> The Setup

  • Hardware: MacBook Pro M4 Pro, 24GB unified memory
  • Model: Gemma 4 26B-A4B (MoE, 3.8B active params per token)
  • Quantization: Unsloth Dynamic Q3_K_XL (12GB, imatrix calibrated)
  • Runtime: llama.cpp via Unsloth Studio
  • Agent: localcoder — 867 lines of Python

> The Journey: 5 Attempts Before It Worked

Attempt 1: Ollama + E4B — Tool Calling Broken

ollama run gemma4:e4b loads fine at 5.5GB. But tool calls go into the reasoning field with empty content. The model generates perfect tool calls — Ollama's parser loses them. (ollama#15241)

Attempt 2: Ollama + 26B — Plans But Never Acts

The 26B loads via Ollama at 20GB. Raw curl test shows perfect tool callingfinish_reason: "tool_calls". But through OpenCode? The model writes markdown plans and never calls a single tool. Same parser bug, different symptom.

Attempt 3: llama.cpp + Q4_K_M (15.6GB) — GPU OOM

Switched to llama.cpp with --jinja. But Q4_K_M weights (15.6GB) + compute buffers (518MB) exceed Metal's ~16GB limit. Every request: 500: Compute error.

Attempt 4: llama.cpp + Q4_K_M + --fit on — Loads, Then Crashes

-fit on auto-offloads 3 layers to CPU. Model loads. But inference still OOMs — compute buffers exceed available GPU memory during forward pass.

Attempt 5: llama.cpp + Unsloth Q3_K_XL (12GB) — It Works!

Unsloth's Dynamic Q3_K_XL uses imatrix calibration at 4.09 bits per weight. 12GB model file = 3.4GB headroom for KV cache and compute buffers.

~/.unsloth/llama.cpp/llama-server \ -m ~/models/gemma4-26b/gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf \ --port 8089 -ngl 99 -c 16384 -np 1 \ -fa on -ctk q4_0 -ctv q4_0 \ --no-warmup --no-mmproj --cache-ram 0 --jinja

No OOM errors. 49 tok/s. Perfect tool calling. First try.

> Memory Breakdown: Why Q3_K_XL Fits

ComponentSize
Model weights (Q3_K_XL on GPU)12,316 MB
Model weights (overflow on CPU)748 MB
KV cache (q4_0 quantized, 16K ctx)174 MB
Compute buffers523 MB
Total GPU~13 GB of ~16 GB limit
Free headroom~3 GB

What Each Flag Does

FlagEffectMemory Impact
Q3_K_XL vs Q4_K_MSmaller quant, imatrix calibrated-3.6 GB
-ctk q4_0 -ctv q4_0Quantize KV cache to 4-bit-75% of KV
-fa onFlash attention (required for KV quant)Enables above
--no-mmprojSkip vision encoder-1.1 GB
--jinjaGemma 4 native tool calling templateEnables tool calls

> Why Ollama Can't Do This (3 Bugs + Missing Quant)

Bug 1: Tool Call Parser Broken

Tool call data goes into reasoning field with empty content. Model generates perfect tool calls — Ollama loses them. (ollama#15241)

Bug 2: Streaming Drops Tool Calls

Even with correct finish_reason: "tool_calls", streaming mode fails. AI SDK layer doesn't parse them. Model says "I cannot execute commands" instead. (opencode#20995)

Bug 3: API Hangs on M4 Macs

/v1/chat/completions hangs indefinitely on M4. Runner processes at 200-380% CPU, zero output. Fixed in 0.20.2, but tool calling bugs remain. (ollama#15258)

The Missing Quantization

Ollama only ships Q4_K_M (18GB) and Q8_0 (28GB) for Gemma 4 26B. No Q3 variants at all. On 24GB hardware, Q4_K_M exceeds Metal's GPU limit = swap = 2 tok/s. The "safe" choice is the broken one.

Same Model, Different Runtime

llama.cpp (Unsloth)Ollama
QuantQ3_K_XL (12GB)Q4_K_M only (18GB)
Speed49 tok/s~2 tok/s
Tool callingPerfectBroken (3 bugs)
GPU fit13/16 GB18+ GB (OOM)

> Benchmarks: Same Tests, 26B Now Works

Same tests from my first article. The 26B via llama.cpp + Q3_K_XL vs E4B and E2B via Ollama:

Coding: Compile & Run

Script26B (Q3_K_XL)E4BE2B
FibonacciPASS (49.2 tok/s)PASSPASS
Sieve of EratosthenesPASS (49.0 tok/s)PASSPASS
Debug merge_sortedPASS (48.2 tok/s)PASSPASS
React+Tailwind appPASS (46.4 tok/s)PASSFAIL

Full-Stack App Generation

Check26B (Q3_K_XL)E4B (Ollama)E2B (Ollama)
Complete HTML fileYesYesNo (fragments)
React CDNYesYes
Tailwind CDNYesYes
useState hooksYesYes
Runs in browserYesYes

Agentic Multi-Step (6-step blog platform)

Metric26B (Q3_K_XL)E4BE2B
Steps completed6/66/66/6
Code blocks557
Output length9,263 chars14,562 chars9,258 chars
Speed46.1 tok/s49 tok/s78 tok/s
QualityProduction-gradeGoodGood

Tool Calling

TestSpeedResult
read_file (single tool)51.3 tok/sPerfect finish_reason: "tool_calls"
mkdir + write (multi-step)50.1 tok/sCorrect tool selection
localcoder one-shot49 tok/s325 tokens in 8s

> The Agent: localcoder

With the model running at 49 tok/s with perfect tool calling, I built a Claude Code-style CLI agent around it. 867 lines of Python, uses prompt_toolkit + rich, talks to llama-server via OpenAI-compatible API.

./localcoder # interactive REPL ./localcoder -p "build a react app" # one-shot mode ./localcoder -c # continue last session ./localcoder --yolo # auto-approve everything ./localcoder -m gemma4-e4b # switch model

6 Generic Tools — No Hardcoded Commands

ToolPurpose
bashRun any shell command
write_fileCreate/overwrite files
read_fileRead files
edit_fileFind-and-replace in files
web_searchDuckDuckGo (text + image search)
fetch_urlFetch URLs via Jina Reader

No ADB tools, no screenshot tools, no image tools. The agent figures out adb shell screencap, mDNS discovery, and npm create vite through bash and reasoning alone.

Key Features

  • Smart context management — auto-compresses old messages, summarizes previous turns
  • Loop detection — catches repeated tool calls, injects "try differently" feedback
  • File snapshots — auto-backs up files before edits, /undo to revert
  • Image display — inline terminal images via timg
  • Session persistence — saves/loads conversation history

> What The Agent Built (Autonomously)

Full-Stack React App

"Build a full-stack app: Express API backend + React frontend with Vite"

The agent created a backend directory with Express, CORS, and dotenv, then scaffolded a React frontend with Vite, installed Tailwind and Lucide icons, wrote a 102-line landing page that fetches data from the API, and started both servers. All autonomously.

Sony Bravia ADB Connection

"Show me what is running on my Sony Bravia TV"

The agent ran mDNS discovery, resolved the hostname to an IP, connected via ADB, found the running activity (com.aistudiotv.MainActivity), and took a screenshot of the TV — all through bash and reasoning. No hardcoded ADB tools.

Web Research + Image Search

"Find the Gemma 4 logo online and download it"

Searched DuckDuckGo, fetched the DeepMind page via Jina Reader, extracted the hero image URL, downloaded it with curl, and displayed it inline in the terminal.

> Full Setup from Scratch (5 Minutes)

Step 1: Install Unsloth

# Installs llama-server at ~/.unsloth/llama.cpp/llama-server curl -fsSL https://unsloth.ai/install.sh | sh

Step 2: Download the Model (~12GB)

huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF \ --local-dir ~/models/gemma4-26b \ --include "*UD-Q3_K_XL*"

Step 3: Start the Server (Terminal 1)

~/.unsloth/llama.cpp/llama-server \ -m ~/models/gemma4-26b/gemma-4-26B-A4B-it-UD-Q3_K_XL.gguf \ --port 8089 -ngl 99 -c 16384 -np 1 \ -fa on -ctk q4_0 -ctv q4_0 \ --no-warmup --no-mmproj --cache-ram 0 --jinja

Step 4: Test It

curl -s http://127.0.0.1:8089/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"gemma4-26b","messages":[{"role":"user","content":"hello"}]}'

Step 5: Run localcoder (Terminal 2)

pip install prompt_toolkit rich brew install timg # optional, for inline images ./localcoder

> How This Actually Works

Dense vs MoE — Why "26B" Fits but "27B" Doesn't

Dense vs MoE architecture comparison

A dense model (Qwen 27B, Gemma 31B) uses ALL parameters per token — 27B weights computing every response. A MoE model (Gemma 26B) has 128 expert networks but only 8 activate per token — 4B doing work. All weights must be in GPU memory, but compute per token is 8x less. Same "size" on paper, completely different GPU reality.

Quantization — Why Q3 Beats Q4

Quantization levels and GPU VRAM usage

Ollama ships Q4_K_M (18GB) — exceeds the ~16GB Metal GPU limit on 24GB Macs. Unsloth's Q3_K_XL uses imatrix calibration: important weights keep higher precision, less critical ones get compressed harder. Result: 12GB at 4.09 bits per weight, nearly Q4 quality, fits with 3GB headroom.

Why Your Mac Gets Hot

Three causes of Mac heat during AI inference

#1 CPU Spill: Model too big for GPU → overflow to CPU → 10-50x slower, much more heat. #2 KV Cache: 256K context = 22GB KV cache on top of model weights. Fix: -ctk q4_0 -ctv q4_0 -c 32768. #3 Dense vs MoE: Dense 31B computes 31B params/token. MoE 26B computes 4B. 8x less work = 8x less heat.

> Key Takeaways

  1. Don't trust "doesn't fit" claims. Q3_K_XL + KV quantization + flash attention turned "impossible" into 49 tok/s on 24GB.
  2. Ollama's conservative quant policy backfires. They only ship Q4_K_M to "protect quality" — but Q4 doesn't fit on 24GB. The result is 2 tok/s with swap thrashing.
  3. The model is great, the toolchain is broken. 3 separate Ollama bugs make Gemma 4 tool calling fail. Always test with llama.cpp before blaming the model.
  4. Unsloth makes this trivial. curl -fsSL https://unsloth.ai/install.sh | sh gives you a pre-built llama-server with all fixes. No cmake, no cherry-picks.
  5. Generic tools beat specialized tools. Six tools (bash, read, write, edit, search, fetch) are more powerful than dozens of hardcoded ones.
  6. MoE models are fast. Only 3.8B parameters active per token — that's why 49 tok/s is possible despite "26B" in the name.

> Updated Scorecard

Original (Ollama)Now (llama.cpp)
26B Speed~2 tok/s49 tok/s
Tool CallingBrokenPerfect
Rating2/108/10
Verdict"Skip on 24GB""My daily driver"

The 26B was never the problem. Ollama was.

██████╗ ███████╗███╗ ███╗��██╗ ███╗ █████╗ ██╗ ██╗ ██╔════╝ ██╔════╝████╗ ████║████╗ ████║██╔══██╗██║ ██║ ██║ ███╗█████╗ ██╔████╔██║██╔████╔██║███████║███████║ ██║ ██║██╔══╝ ██║╚██╔╝██║██║╚██╔╝██║██╔══██║╚════██║ ╚██████╔╝███████╗██║ ╚═╝ ██║██║ ╚═╝ ██║██║ ██║ ██║ ╚═════╝ ╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝ 2 / 1 0 → 8 / 1 0 • L O C A L • 4 9 t o k / s