AI Model Rankings

RAW TESTS ONLY March 2026

Direct model evaluation — no augmentation, no external tools, pure baseline capability

šŸ”¬ Test Methodology

What is "RAW" Testing?

RAW tests evaluate models with direct prompts — no augmentation, no prompt engineering, no external tools. This measures each model's baseline capability across coding, reasoning and planning tasks when given a pure task description.

Test Pipeline

1ļøāƒ£ Task Prompt

Send a coding, reasoning or planning challenge directly to the model with clear requirements

2ļøāƒ£ Model Response

Model returns implementation code + unit tests

3ļøāƒ£ Vitest Run

Execute unit tests to verify correctness

4ļøāƒ£ TSC Check

TypeScript Compiler checks type safety

5ļøāƒ£ Wilson Score

Calculate adjusted ranking score

šŸ“ How Rankings Are Calculated

āš ļø The 100% Problem

Simple pass rate (passes / total) is misleading:

  • 1 pass, 0 fail = 100% (but only 1 test!)
  • 10 pass, 1 fail = 90.9% (but 11 tests!)

A model that passes 1/1 looks better than one that passes 10/11. That's wrong.

šŸŽÆ Solution: Adjusted Wilson Score

Wilson Score Lower Bound accounts for both pass rate AND sample size:

Wilson = (p + z²/2n - z√[p(1-p)/n + z²/4n²] / (1+z²/n)) Ɨ n

Where:

  • p = pass rate (passes / total)
  • n = total tests attempted
  • z = 1.96 (95% confidence)

Result: A model with 10/11 passes gets a higher Wilson score than 1/1 passes. More tests = more confidence = fair ranking.

Score Components

Metric Description
Pass All unit tests passed
Fail Tests ran but failed (logic/implementation errors)
Wilson Adjusted Wilson Score (primary ranking)
Pass Rate For reference only (not used for ranking)

šŸ† Overall Rankings (All Categories Combined)

Ranked by Adjusted Wilson Score — higher is better

Rank Model Pass Fail Wilson Rate
1deepseek-v3.247641.0388.7%
2qwen3-coder-next43936.5482.7%
3grok-440335.0093.0%
4minimax-m2.541834.7783.7%
5glm-4.7391032.5279.6%
6glm-534129.9197.1%
7qwen3-coder31326.1991.2%
8haiku-4.527123.0496.4%
9gemini-2.5-pro26022.65100%
10grok-4.1-fast-reason23518.0382.1%

šŸ¦€ Rust Rankings

šŸ“ TypeScript Rankings

🧠 Reasoning Rankings

āœļø Non-Coding Rankings

šŸ“‹ Test Tasks by Category

šŸ“ TypeScript

async-retry: Exponential backoff with jitter, configurable retry predicate

typed-emitter: Generic type-safe event emitter

rate-limiter: Token bucket algorithm

šŸ¦€ Rust

builder-pattern: Builder creational pattern

channel-mpmc: Multi-producer multi-consumer

functional-pipeline: Iterator combinators

generic-cache: Generic cache implementation

state-machine: State machine pattern

🧠 Reasoning

Logic puzzles, cryptarithmetic, spatial reasoning, tournament brackets

āœļø Non-Coding

Writing tasks, analysis, structured output, JSON extraction

šŸ“ TypeScript Tasks — Detailed

šŸ¦€ Rust Tasks — Detailed

🧠 Reasoning Tasks — Detailed

āœļø Non-Coding Tasks — Detailed

šŸ“– Abbreviations

Abbreviation Meaning
RAW Direct model calls without any augmentation
VITEST JavaScript/TypeScript test runner
TSC TypeScript Compiler (type checking)
Wilson Adjusted Wilson Score (confidence-weighted ranking)
TS TypeScript
Pass Rate passes / (passes + failures) — for reference only

šŸ¤– Model Rankings — High Tech Mind | March 2026

Ā© 2026 High Tech Mind B.V. All rights reserved.

KvK: 97769894 | BTW: NL868223529B01

šŸ”— Open Brain GitHub Repository