I Tested 6 Local Coder LLMs on Real Apps. The Leaderboard Surprised Me.
Six models, five apps, one honest leaderboard. qwen3-coder isn't the king anymore.
A three-part series on benchmarking local coder LLMs with real apps, Playwright validation, and Opus pairwise judging. From the leaderboard results, through the debugging story, to the reusable methodology.
Six models, five apps, one honest leaderboard. qwen3-coder isn't the king anymore.
Two overnight runs. Two rankings. Neither was real. Here's how my benchmark lied and what fixed it.
Seven rules from a bench that lied twice before it told the truth. Reusable beyond LLMs.
I wanted to know which local coder LLM to actually use daily. Six models, seven benchmarks, one honest leaderboard. Getting honest results took a week of debugging my own tooling.
Everything is open source: github.com/NightOwlCoder/local-llm-bench
Add your models to bench.config.json, run ./bench.sh, get comparable scores.