Benchmarking Local LLMs Series

A three-part series on benchmarking local coder LLMs with real apps, Playwright validation, and Opus pairwise judging. From the leaderboard results, through the debugging story, to the reusable methodology.

✅ Complete (3 parts)

Part 1

I Tested 6 Local Coder LLMs on Real Apps. The Leaderboard Surprised Me.

Six models, five apps, one honest leaderboard. qwen3-coder isn't the king anymore.

May 05, 2026

Part 2

The Week My LLM Benchmark Was Lying to Me

Two overnight runs. Two rankings. Neither was real. Here's how my benchmark lied and what fixed it.

May 05, 2026

Part 3

How to Benchmark Local LLMs Honestly

Seven rules from a bench that lied twice before it told the truth. Reusable beyond LLMs.

May 05, 2026

About This Series

I wanted to know which local coder LLM to actually use daily. Six models, seven benchmarks, one honest leaderboard. Getting honest results took a week of debugging my own tooling.

What You’ll Learn

Part 1: The leaderboard — which model actually wins
Part 2: Six ways my benchmark lied to me
Part 3: Seven rules for benchmarking LLMs honestly

The Framework

Everything is open source: github.com/NightOwlCoder/local-llm-bench

Add your models to bench.config.json, run ./bench.sh, get comparable scores.

nightowlcoder@home:~$

Archive

About

RSS

Benchmarking Local LLMs Series

I Tested 6 Local Coder LLMs on Real Apps. The Leaderboard Surprised Me.

The Week My LLM Benchmark Was Lying to Me

How to Benchmark Local LLMs Honestly

About This Series

What You’ll Learn

The Framework