AI Model Rankings
RAW TESTS ONLY March 2026
Direct model evaluation ā no augmentation, no external tools, pure baseline capability
š¬ Test Methodology
What is "RAW" Testing?
RAW tests evaluate models with direct prompts ā no augmentation, no prompt engineering, no external tools. This measures each model's baseline capability across coding, reasoning and planning tasks when given a pure task description.
Test Pipeline
1ļøā£ Task Prompt
Send a coding, reasoning or planning challenge directly to the model with clear requirements
2ļøā£ Model Response
Model returns implementation code + unit tests
3ļøā£ Vitest Run
Execute unit tests to verify correctness
4ļøā£ TSC Check
TypeScript Compiler checks type safety
5ļøā£ Wilson Score
Calculate adjusted ranking score
š How Rankings Are Calculated
ā ļø The 100% Problem
Simple pass rate (passes / total) is misleading:
- 1 pass, 0 fail = 100% (but only 1 test!)
- 10 pass, 1 fail = 90.9% (but 11 tests!)
A model that passes 1/1 looks better than one that passes 10/11. That's wrong.
šÆ Solution: Adjusted Wilson Score
Wilson Score Lower Bound accounts for both pass rate AND sample size:
Where:
p= pass rate (passes / total)n= total tests attemptedz= 1.96 (95% confidence)
Result: A model with 10/11 passes gets a higher Wilson score than 1/1 passes. More tests = more confidence = fair ranking.
Score Components
| Metric | Description |
|---|---|
| Pass | All unit tests passed |
| Fail | Tests ran but failed (logic/implementation errors) |
| Wilson | Adjusted Wilson Score (primary ranking) |
| Pass Rate | For reference only (not used for ranking) |
š Overall Rankings (All Categories Combined)
Ranked by Adjusted Wilson Score ā higher is better
| Rank | Model | Pass | Fail | Wilson | Rate |
|---|---|---|---|---|---|
| 1 | deepseek-v3.2 | 47 | 6 | 41.03 | 88.7% |
| 2 | qwen3-coder-next | 43 | 9 | 36.54 | 82.7% |
| 3 | grok-4 | 40 | 3 | 35.00 | 93.0% |
| 4 | minimax-m2.5 | 41 | 8 | 34.77 | 83.7% |
| 5 | glm-4.7 | 39 | 10 | 32.52 | 79.6% |
| 6 | glm-5 | 34 | 1 | 29.91 | 97.1% |
| 7 | qwen3-coder | 31 | 3 | 26.19 | 91.2% |
| 8 | haiku-4.5 | 27 | 1 | 23.04 | 96.4% |
| 9 | gemini-2.5-pro | 26 | 0 | 22.65 | 100% |
| 10 | grok-4.1-fast-reason | 23 | 5 | 18.03 | 82.1% |
š¦ Rust Rankings
š TypeScript Rankings
š§ Reasoning Rankings
āļø Non-Coding Rankings
š Test Tasks by Category
š TypeScript
async-retry: Exponential backoff with jitter, configurable retry predicate
typed-emitter: Generic type-safe event emitter
rate-limiter: Token bucket algorithm
š¦ Rust
builder-pattern: Builder creational pattern
channel-mpmc: Multi-producer multi-consumer
functional-pipeline: Iterator combinators
generic-cache: Generic cache implementation
state-machine: State machine pattern
š§ Reasoning
Logic puzzles, cryptarithmetic, spatial reasoning, tournament brackets
āļø Non-Coding
Writing tasks, analysis, structured output, JSON extraction
š TypeScript Tasks ā Detailed
š¦ Rust Tasks ā Detailed
š§ Reasoning Tasks ā Detailed
āļø Non-Coding Tasks ā Detailed
š Abbreviations
| Abbreviation | Meaning |
|---|---|
RAW |
Direct model calls without any augmentation |
VITEST |
JavaScript/TypeScript test runner |
TSC |
TypeScript Compiler (type checking) |
Wilson |
Adjusted Wilson Score (confidence-weighted ranking) |
TS |
TypeScript |
Pass Rate |
passes / (passes + failures) ā for reference only |