nightowlcoder@home:~$

Benchmarking Local LLMs Series

A three-part series on benchmarking local coder LLMs with real apps, Playwright validation, and Opus pairwise judging. From the leaderboard results, through the debugging story, to the reusable methodology.

✅ Complete (3 parts)
Part 1

I Tested 6 Local Coder LLMs on Real Apps. The Leaderboard Surprised Me.

Six models, five apps, one honest leaderboard. qwen3-coder isn't the king anymore.

Part 2

The Week My LLM Benchmark Was Lying to Me

Two overnight runs. Two rankings. Neither was real. Here's how my benchmark lied and what fixed it.

Part 3

How to Benchmark Local LLMs Honestly

Seven rules from a bench that lied twice before it told the truth. Reusable beyond LLMs.

About This Series

I wanted to know which local coder LLM to actually use daily. Six models, seven benchmarks, one honest leaderboard. Getting honest results took a week of debugging my own tooling.

What You’ll Learn

The Framework

Everything is open source: github.com/NightOwlCoder/local-llm-bench

Add your models to bench.config.json, run ./bench.sh, get comparable scores.