AI Model Rankings — Wilson Score Benchmarks for Coding, Reasoning & Planning

🔬 Test Methodology

What is "RAW" Testing?

RAW tests evaluate models with direct prompts — no augmentation, no prompt engineering, no external tools. This measures each model's baseline capability across coding, reasoning and planning tasks when given a pure task description.

Test Pipeline

1️⃣ Task Prompt

Send a coding, reasoning or planning challenge directly to the model with clear requirements

2️⃣ Model Response

Model returns implementation code + unit tests

3️⃣ Vitest Run

Execute unit tests to verify correctness

4️⃣ TSC Check

TypeScript Compiler checks type safety

5️⃣ Wilson Score

Calculate adjusted ranking score

📐 How Rankings Are Calculated

⚠️ The 100% Problem

Simple pass rate (passes / total) is misleading:

1 pass, 0 fail = 100% (but only 1 test!)
10 pass, 1 fail = 90.9% (but 11 tests!)

A model that passes 1/1 looks better than one that passes 10/11. That's wrong.

🎯 Solution: Adjusted Wilson Score

Wilson Score Lower Bound accounts for both pass rate AND sample size:

Wilson = (p + z²/2n - z√[p(1-p)/n + z²/4n²] / (1+z²/n)) × n

Where:

p = pass rate (passes / total)
n = total tests attempted
z = 1.96 (95% confidence)

Result: A model with 10/11 passes gets a higher Wilson score than 1/1 passes. More tests = more confidence = fair ranking.

Score Components

Metric	Description
Pass	All unit tests passed
Fail	Tests ran but failed (logic/implementation errors)
Wilson	Adjusted Wilson Score (primary ranking)
Pass Rate	For reference only (not used for ranking)

🏆 Overall Rankings (All Categories Combined)

Ranked by Adjusted Wilson Score — higher is better

Rank	Model	Pass	Fail	Wilson	Rate
1	opus-4.7 NEW	171	1	166.46	99%
2	glm-5.1	134	1	129.50	99%
3	opus-4.5	130	2	124.93	98%
4	opus-4.6	119	1	114.52	99%
5	gpt-5.4-mini	119	10	111.35	92%
6	haiku-4.5	110	6	103.44	95%
7	deepseek-v4-pro NEW	100	1	95.55	99%
8	gpt-5.3	99	1	94.55	99%
9	deepseek-v4-flash NEW	99	4	93.15	96%
10	qwen3-coder	99	8	91.95	93%

Rank	Model	Pass	Fail	Wilson	Rate
11	minimax-m2.5	97	10	89.50	91%
12	gemini-3-flash	90	3	84.57	97%
13	mimo-v2.5 NEW	89	7	82.28	93%
14	hunter	88	8	81.03	92%
15	mimo-v2-pro	86	10	78.60	90%
16	healer	85	7	78.31	92%
17	qwen3.6-flash NEW	82	3	76.61	96%
18	glm-5	81	2	76.06	98%
19	qwen3.6-plus	80	1	75.60	99%
19	gpt-5-mini	80	1	75.60	99%
21	glm-4.7	80	5	73.91	94%
22	mimo-v2.5-pro NEW	79	4	73.25	95%
23	gemini-3.1-pro	76	0	72.34	100%
24	mimo-v2-flash	78	7	71.37	92%
25	deepseek-v3.2	78	9	70.90	90%
26	gpt-5.2	74	0	70.35	100%
27	gemini-3-pro	72	0	68.35	100%
27	gpt-5.4	72	0	68.35	100%
29	grok-code-fast-1	72	2	67.10	97%
30	gpt-5.2-codex	70	0	66.36	100%
31	deepseek-v3.2-exp	72	4	66.30	95%
32	deepseek-v3	72	7	65.43	91%
33	qwen3.6-max NEW	67	1	62.65	99%
34	qwen3-coder-next	67	12	59.49	85%
35	mimo-v2-omni	66	17	57.79	80%
36	command-a	65	13	57.36	83%
37	codestral	63	12	55.56	84%
38	gpt-5.4-nano	64	20	55.49	76%
39	gemini-2.5-pro	59	0	55.39	100%
40	ling-2.6-1t NEW	61	5	55.08	92%

🦀 Rust Rankings

Ranked by Adjusted Wilson Score — raw tests only, 9 tasks

Rank	Model	Pass	Fail	Wilson
1	opus-4.7 NEW	106	0	102.29
2	opus-4.5	96	1	91.56
3	glm-5.1	93	0	89.31
4	opus-4.6	86	0	82.32
5	gpt-5.3	77	1	72.61
6	gpt-5.4-mini	72	0	68.35
7	gemini-3-pro	63	0	59.38
8	gemini-3.1-pro	60	0	56.39
9	deepseek-v4-pro NEW	58	1	53.69
10	gpt-5.4	56	0	52.41

Rank	Model	Pass	Fail	Wilson
11	gpt-5.2	52	0	48.42
12	gpt-5.2-codex	48	0	44.44
13	deepseek-v4-flash NEW	49	3	43.87
14	haiku-4.5	45	1	40.79
14	qwen3.6-plus	45	1	40.79
16	mimo-v2-pro	43	2	38.33
17	healer	40	1	35.84
18	gemini-3-flash	39	0	35.50
18	hunter	39	0	35.50
20	o4-mini	38	0	34.51
21	deepseek-v3.2-exp	38	3	33.03
22	gpt-5.1	37	1	32.87
23	qwen3.5-397b	36	0	32.53
24	o3	35	0	31.54
24	gpt-5-nano	35	0	31.54
26	qwen3-coder	35	1	30.90
27	gpt-5.1-codex	34	0	30.55
28	mimo-v2-flash	33	1	28.93
29	glm-5	32	0	28.57
30	gpt-oss-120b	32	1	27.94
30	qwen3.6-max NEW	32	1	27.94
32	kimi-k2.5	31	0	27.58
33	qwen3.6-flash NEW	32	2	27.51
34	deepseek-v3.2	30	1	25.98
35	qwen3-max-thinking	28	0	24.62
36	grok-code-fast-1	26	1	22.06
37	gpt-5-mini	24	0	20.69
37	grok-4	24	0	20.69
39	gpt-5.4-nano	24	2	19.72
39	mimo-v2-omni	24	2	19.72
41	ling-2.6-1t NEW	22	0	18.73
42	mistral-large-2512	22	1	18.17
43	glm-4.7	21	0	17.75
44	command-a	21	1	17.20
45	mimo-v2.5-pro NEW	20	1	16.24
46	codestral	18	0	14.83
46	trinity-large	18	0	14.83
48	gemini-2.5-flash	19	3	14.67
49	deepseek-v3	17	0	13.87
50	mimo-v2.5 NEW	16	2	12.10
51	kimi-k2.6 NEW	14	0	10.99
52	step-3.5-flash	14	1	10.53
53	hy3-preview NEW	13	0	10.03
54	qwen3.6-35b-a3b NEW	14	3	10.02
55	qwen3-coder-next	12	0	9.09
56	devstral-medium	10	0	7.22
57	minimax-m2.5	10	1	6.85
57	seed-2.0-lite	10	1	6.85
59	devstral-2	10	3	6.47
60	gemini-2.5-pro	9	0	6.31
61	llama-4-maverick	7	0	4.52
62	deepseek-v3.2-speciale	6	0	3.66
63	llama-4-scout	4	0	2.04
63	nova-premier	4	0	2.04
63	seed-2.0-mini	4	0	2.04
63	qwen3.6-27b NEW	4	0	2.04
67	grok-4.1-fast	3	0	1.32

📝 TypeScript Rankings

Ranked by Adjusted Wilson Score — raw tests only, 3 tasks

Rank	Model	Pass	Fail	Wilson
1	minimax-m2.5	52	9	45.31
2	mimo-v2.5 NEW	41	2	36.35
3	opus-4.5	34	1	29.91
4	grok-4.1-fast	35	4	29.80
5	opus-4.6	33	1	28.93
6	qwen3-coder	33	3	28.14
7	opus-4.7 NEW	31	0	27.58
8	haiku-4.5	32	3	27.17
9	grok-4	31	1	26.96
10	mimo-v2.5-pro NEW	27	0	23.64

Rank	Model	Pass	Fail	Wilson
11	qwen3-coder-next	28	4	23.02
12	glm-4.7	25	4	20.14
13	gpt-5.3	22	0	18.73
13	gpt-5.2	22	0	18.73
13	gpt-5.2-codex	22	0	18.73
16	gpt-5-mini	21	1	17.20
17	deepseek-v3	22	5	17.09
18	glm-5	18	0	14.83
19	gemini-3-flash	18	1	14.32
20	gpt-oss-120b	18	3	13.73
20	o3	18	3	13.73
20	llama-4-maverick	18	3	13.73
23	gpt-5.4-mini	18	4	13.53
24	minimax-m2.7	18	5	13.36
25	gpt-5.4	16	0	12.90
25	gemini-3.1-pro	16	0	12.90
27	command-a	17	4	12.60
28	codestral	17	5	12.44
29	o4-mini	17	6	12.31
29	gpt-5.4-nano	17	6	12.31
31	gemini-2.5-pro	15	0	11.94
32	deepseek-v3.2	16	5	11.53
33	qwen3-max-thinking	15	1	11.47
33	mimo-v2-flash	15	1	11.47
33	deepseek-v4-flash NEW	15	1	11.47
33	qwen3.6-flash NEW	15	1	11.47
37	devstral-medium	16	6	11.41
38	devstral-2	14	1	10.53
39	hunter	15	8	10.32
40	mimo-v2-pro	14	2	10.24
41	grok-code-fast-1	11	1	7.75
42	llama-4-scout	11	4	7.21
42	qwen3.6-35b-a3b NEW	11	4	7.21
44	healer	11	5	7.10
45	gemini-3-pro	9	0	6.31
45	gpt-5.1	9	0	6.31
47	mimo-v2-omni	10	12	5.92
48	gpt-5.1-codex	7	0	4.52
48	glm-5.1	7	0	4.52
48	kimi-k2.6 NEW	7	0	4.52
48	deepseek-v4-pro NEW	7	0	4.52
52	ling-2.6-1t NEW	7	2	4.07
53	qwen3.5-397b	6	0	3.66
54	qwen3.6-27b NEW	6	1	3.41

🧠 Reasoning Rankings

Rank	Model	Pass	Wilson
1	grok-code-fast-1	15	11.94	100%
1	minimax-m2.5	15	11.94	100%
1	gpt-5-mini	15	11.94	100%
1	gemini-2.5-pro	15	11.94	100%
1	seed-2.0-mini	15	11.94	100%
1	gpt-5.4-pro	15	11.94	100%
1	hunter	15	11.94	100%
1	healer	15	11.94	100%
1	glm-5.1	15	11.94	100%
1	qwen3.6-plus	15	11.94	100%
1	seed-2.0-lite	15	11.94	100%
1	kimi-k2.6 NEW	15	11.94	100%
1	deepseek-v4-pro NEW	15	11.94	100%
1	deepseek-v4-flash NEW	15	11.94	100%
1	qwen3.6-max NEW	15	11.94	100%
1	qwen3.6-flash NEW	15	11.94	100%
1	qwen3.6-35b-a3b NEW	15	11.94	100%

✍️ Non-Coding Rankings

Rank	Model	Pass	Wilson
1	minimax-m2.5	20	16.78	100%
1	glm-4.7	20	16.78	100%
1	llama-4-maverick	20	16.78	100%
1	grok-code-fast-1	20	16.78	100%
1	gemini-3-flash	20	16.78	100%
1	deepseek-v3.2	20	16.78	100%
1	deepseek-v3	20	16.78	100%
1	deepseek-v3.2-exp	20	16.78	100%
1	gpt-5-mini	20	16.78	100%
1	gemini-2.5-pro	20	16.78	100%
1	minimax-m2.7	20	16.78	100%
1	qwen3.6-plus	20	16.78	100%
1	seed-2.0-lite	20	16.78	100%
1	opus-4.7 NEW	20	16.78	100%
1	deepseek-v4-pro NEW	20	16.78	100%
1	deepseek-v4-flash NEW	20	16.78	100%
1	qwen3.6-max NEW	20	16.78	100%
1	qwen3.6-flash NEW	20	16.78	100%

📋 Test Tasks by Category

📝 TypeScript

async-retry: Exponential backoff with jitter, configurable retry predicate

typed-emitter: Generic type-safe event emitter

rate-limiter: Token bucket algorithm

🦀 Rust

builder-pattern: Builder creational pattern

channel-mpmc: Multi-producer multi-consumer

functional-pipeline: Iterator combinators

generic-cache: Generic cache implementation

state-machine: State machine pattern

🧠 Reasoning

Logic puzzles, cryptarithmetic, spatial reasoning, tournament brackets

✍️ Non-Coding

Writing tasks, analysis, structured output, JSON extraction

📝 TypeScript Tasks — Detailed

mcp-server

Write a TypeScript MCP (Model Context Protocol) server implementation with tool registration, JSON-RPC request handling, and input validation against JSON Schema.

Features: Tool registration, tools/list, tools/call methods, type validation
Types: ToolDefinition, ToolResult, MCPRequest, MCPResponse

skill-router

Write a TypeScript skill/command router for an AI assistant with prefix, regex, and keyword matching.

Features: Multiple trigger types (prefix, regex, keyword), parameter extraction, priority handling
Matching: prefix > regex > keyword priority

tool-planner

Write a TypeScript tool execution planner that builds dependency graphs and executes tools in parallel.

Features: DAG-based planning, parallel execution, dependency resolution
Validation: Checks all inputs are satisfied from available data or previous steps

🦀 Rust Tasks — Detailed

trait-dispatch

Write a Rust plugin system using trait objects with chain and parallel execution modes.

Features: Plugin trait, PluginRegistry, chain() and parallel() execution
Error types: NotFound, ExecutionError, ChainError with step index

generic-cache

Write a generic LRU cache with TTL expiration using only std collections.

Features: LRU eviction, time-to-live, get/insert/remove/cleanup
Constraints: No external crates, use HashMap + Vec/VecDeque

state-machine

Write a type-safe state machine for order processing using typestate pattern.

States: Draft → Submitted → Processing → Shipped → Delivered / Cancelled
Feature: Compile-time enforcement of valid transitions

iterator-combinators

Write custom iterator combinators (windowed, interleave, unique_by, batched) without itertools.

Combinators: Windowed, Interleave, UniqueBy, Batched
Extension trait: IteratorExt with chainable methods

channel-mpmc

Write a multi-producer multi-consumer bounded channel using std::sync primitives.

Features: Bounded capacity, try_send/try_recv, proper disconnect handling
Primitives: Mutex, Condvar, Arc, VecDeque

builder-pattern

Write a type-safe HTTP request builder using typestate pattern.

States: NoUrl → HasUrl → HasMethod
Feature: Compile-time enforcement of required fields (url, method)

result-combinators

Write Result extension trait with error handling combinators.

Methods: map_err_with_context, or_retry, and_validate, tap, tap_err
Utilities: compose_results, fallback_chain

functional-pipeline

Write a functional data processing pipeline with closures and higher-order functions.

Pipeline: map, flat_map, filter, inspect, unwrap
Utilities: compose, pipe, apply_n, fold_while, scan_collect, group_by_key

tokio-task-pool

Write an async task pool using tokio with work stealing and graceful shutdown.

Features: spawn, spawn_blocking, shutdown, map_concurrent, race, timeout_race
Concurrency: Bounded parallelism, work stealing

🧠 Reasoning Tasks — Detailed

Logic Deduction

Solve logic puzzles involving suspects, truth-telling, and deductions.

Example: "Three suspects: Alex, Blake, Casey. Exactly one is guilty. The guilty person always lies; innocent people always tell the truth..."

Math Word Problems

Solve multi-step word problems involving rates, percentages, and work calculations.

Topics: Train meeting problems, discounts with tax, work collaboration

Sequence Prediction

Find patterns in number sequences and predict next values.

Example: "2, 3, 5, 9, 17, 33, __, __, __" (hint: look at differences)

Spatial Reasoning

Solve puzzles involving spatial arrangements, grids, and visual patterns.

Cryptarithmetic

Solve alphametic puzzles where letters represent digits.

Tournament Brackets

Determine tournament outcomes given constraints.

✍️ Non-Coding Tasks — Detailed

Writing Tasks

Generate coherent, well-structured text based on prompts.

Evaluated on: Grammar, coherence, relevance, style

Analysis Tasks

Analyze data or text and provide insights.

Structured Output

Generate output in specific formats (JSON, CSV, markdown tables).

Verification: Output must match exact schema

JSON Extraction

Extract structured data from unstructured text.

Classification

Categorize items based on provided criteria.

Summarization

Condense longer text into key points.

📖 Abbreviations

Abbreviation	Meaning
`RAW`	Direct model calls without any augmentation
`VITEST`	JavaScript/TypeScript test runner
`TSC`	TypeScript Compiler (type checking)
`Wilson`	Adjusted Wilson Score (confidence-weighted ranking)
`TS`	TypeScript
`Pass Rate`	passes / (passes + failures) — for reference only