AI Model Rankings
RAW TESTS ONLY March 14, 2026
Direct model evaluation ā no augmentation, no external tools, pure baseline capability
š¬ Test Methodology
What is "RAW" Testing?
RAW tests evaluate models with direct prompts ā no augmentation, no prompt engineering, no external tools. This measures each model's baseline capability across coding, reasoning and planning tasks when given a pure task description.
Test Pipeline
1ļøā£ Task Prompt
Send a coding, reasoning or planning challenge directly to the model with clear requirements
2ļøā£ Model Response
Model returns implementation code + unit tests
3ļøā£ Vitest Run
Execute unit tests to verify correctness
4ļøā£ TSC Check
TypeScript Compiler checks type safety
5ļøā£ Wilson Score
Calculate adjusted ranking score
š How Rankings Are Calculated
ā ļø The 100% Problem
Simple pass rate (passes / total) is misleading:
- 1 pass, 0 fail = 100% (but only 1 test!)
- 10 pass, 1 fail = 90.9% (but 11 tests!)
A model that passes 1/1 looks better than one that passes 10/11. That's wrong.
šÆ Solution: Adjusted Wilson Score
Wilson Score Lower Bound accounts for both pass rate AND sample size:
Where:
p= pass rate (passes / total)n= total tests attemptedz= 1.96 (95% confidence)
Result: A model with 10/11 passes gets a higher Wilson score than 1/1 passes. More tests = more confidence = fair ranking.
Score Components
| Metric | Description |
|---|---|
| Pass | All unit tests passed |
| Fail | Tests ran but failed (logic/implementation errors) |
| Wilson | Adjusted Wilson Score (primary ranking) |
| Pass Rate | For reference only (not used for ranking) |
š Overall Rankings (All Categories Combined)
Ranked by Adjusted Wilson Score ā higher is better
| Rank | Model | Pass | Fail | Wilson | Rate |
|---|---|---|---|---|---|
| 1 | opus-4.7 NEW | 171 | 1 | 166.46 | 99% |
| 2 | glm-5.1 | 134 | 1 | 129.50 | 99% |
| 3 | opus-4.5 | 130 | 2 | 124.93 | 98% |
| 4 | opus-4.6 | 119 | 1 | 114.52 | 99% |
| 5 | gpt-5.4-mini | 119 | 10 | 111.35 | 92% |
| 6 | haiku-4.5 | 110 | 6 | 103.44 | 95% |
| 7 | deepseek-v4-pro NEW | 100 | 1 | 95.55 | 99% |
| 8 | gpt-5.3 | 99 | 1 | 94.55 | 99% |
| 9 | deepseek-v4-flash NEW | 99 | 4 | 93.15 | 96% |
| 10 | qwen3-coder | 99 | 8 | 91.95 | 93% |
š¦ Rust Rankings
Ranked by Adjusted Wilson Score ā raw tests only, 9 tasks
| Rank | Model | Pass | Fail | Wilson |
|---|---|---|---|---|
| 1 | opus-4.7 NEW | 106 | 0 | 102.29 |
| 2 | opus-4.5 | 96 | 1 | 91.56 |
| 3 | glm-5.1 | 93 | 0 | 89.31 |
| 4 | opus-4.6 | 86 | 0 | 82.32 |
| 5 | gpt-5.3 | 77 | 1 | 72.61 |
| 6 | gpt-5.4-mini | 72 | 0 | 68.35 |
| 7 | gemini-3-pro | 63 | 0 | 59.38 |
| 8 | gemini-3.1-pro | 60 | 0 | 56.39 |
| 9 | deepseek-v4-pro NEW | 58 | 1 | 53.69 |
| 10 | gpt-5.4 | 56 | 0 | 52.41 |
š TypeScript Rankings
Ranked by Adjusted Wilson Score ā raw tests only, 3 tasks
| Rank | Model | Pass | Fail | Wilson |
|---|---|---|---|---|
| 1 | minimax-m2.5 | 52 | 9 | 45.31 |
| 2 | mimo-v2.5 NEW | 41 | 2 | 36.35 |
| 3 | opus-4.5 | 34 | 1 | 29.91 |
| 4 | grok-4.1-fast | 35 | 4 | 29.80 |
| 5 | opus-4.6 | 33 | 1 | 28.93 |
| 6 | qwen3-coder | 33 | 3 | 28.14 |
| 7 | opus-4.7 NEW | 31 | 0 | 27.58 |
| 8 | haiku-4.5 | 32 | 3 | 27.17 |
| 9 | grok-4 | 31 | 1 | 26.96 |
| 10 | mimo-v2.5-pro NEW | 27 | 0 | 23.64 |
š§ Reasoning Rankings
āļø Non-Coding Rankings
š Test Tasks by Category
š TypeScript
async-retry: Exponential backoff with jitter, configurable retry predicate
typed-emitter: Generic type-safe event emitter
rate-limiter: Token bucket algorithm
š¦ Rust
builder-pattern: Builder creational pattern
channel-mpmc: Multi-producer multi-consumer
functional-pipeline: Iterator combinators
generic-cache: Generic cache implementation
state-machine: State machine pattern
š§ Reasoning
Logic puzzles, cryptarithmetic, spatial reasoning, tournament brackets
āļø Non-Coding
Writing tasks, analysis, structured output, JSON extraction
š TypeScript Tasks ā Detailed
š¦ Rust Tasks ā Detailed
š§ Reasoning Tasks ā Detailed
āļø Non-Coding Tasks ā Detailed
š Abbreviations
| Abbreviation | Meaning |
|---|---|
RAW |
Direct model calls without any augmentation |
VITEST |
JavaScript/TypeScript test runner |
TSC |
TypeScript Compiler (type checking) |
Wilson |
Adjusted Wilson Score (confidence-weighted ranking) |
TS |
TypeScript |
Pass Rate |
passes / (passes + failures) ā for reference only |