AI Model Rankings

RAW TESTS ONLY March 14, 2026

Direct model evaluation — no augmentation, no external tools, pure baseline capability

šŸ”¬ Test Methodology

What is "RAW" Testing?

RAW tests evaluate models with direct prompts — no augmentation, no prompt engineering, no external tools. This measures each model's baseline capability across coding, reasoning and planning tasks when given a pure task description.

Test Pipeline

1ļøāƒ£ Task Prompt

Send a coding, reasoning or planning challenge directly to the model with clear requirements

2ļøāƒ£ Model Response

Model returns implementation code + unit tests

3ļøāƒ£ Vitest Run

Execute unit tests to verify correctness

4ļøāƒ£ TSC Check

TypeScript Compiler checks type safety

5ļøāƒ£ Wilson Score

Calculate adjusted ranking score

šŸ“ How Rankings Are Calculated

āš ļø The 100% Problem

Simple pass rate (passes / total) is misleading:

  • 1 pass, 0 fail = 100% (but only 1 test!)
  • 10 pass, 1 fail = 90.9% (but 11 tests!)

A model that passes 1/1 looks better than one that passes 10/11. That's wrong.

šŸŽÆ Solution: Adjusted Wilson Score

Wilson Score Lower Bound accounts for both pass rate AND sample size:

Wilson = (p + z²/2n - z√[p(1-p)/n + z²/4n²] / (1+z²/n)) Ɨ n

Where:

  • p = pass rate (passes / total)
  • n = total tests attempted
  • z = 1.96 (95% confidence)

Result: A model with 10/11 passes gets a higher Wilson score than 1/1 passes. More tests = more confidence = fair ranking.

Score Components

Metric Description
Pass All unit tests passed
Fail Tests ran but failed (logic/implementation errors)
Wilson Adjusted Wilson Score (primary ranking)
Pass Rate For reference only (not used for ranking)

šŸ† Overall Rankings (All Categories Combined)

Ranked by Adjusted Wilson Score — higher is better

Rank Model Pass Fail Wilson Rate
1opus-4.7 NEW1711166.4699%
2glm-5.11341129.5099%
3opus-4.51302124.9398%
4opus-4.61191114.5299%
5gpt-5.4-mini11910111.3592%
6haiku-4.51106103.4495%
7deepseek-v4-pro NEW100195.5599%
8gpt-5.399194.5599%
9deepseek-v4-flash NEW99493.1596%
10qwen3-coder99891.9593%

šŸ¦€ Rust Rankings

Ranked by Adjusted Wilson Score — raw tests only, 9 tasks

Rank Model Pass Fail Wilson
1opus-4.7 NEW1060102.29
2opus-4.596191.56
3glm-5.193089.31
4opus-4.686082.32
5gpt-5.377172.61
6gpt-5.4-mini72068.35
7gemini-3-pro63059.38
8gemini-3.1-pro60056.39
9deepseek-v4-pro NEW58153.69
10gpt-5.456052.41

šŸ“ TypeScript Rankings

Ranked by Adjusted Wilson Score — raw tests only, 3 tasks

Rank Model Pass Fail Wilson
1minimax-m2.552945.31
2mimo-v2.5 NEW41236.35
3opus-4.534129.91
4grok-4.1-fast35429.80
5opus-4.633128.93
6qwen3-coder33328.14
7opus-4.7 NEW31027.58
8haiku-4.532327.17
9grok-431126.96
10mimo-v2.5-pro NEW27023.64

🧠 Reasoning Rankings

āœļø Non-Coding Rankings

šŸ“‹ Test Tasks by Category

šŸ“ TypeScript

async-retry: Exponential backoff with jitter, configurable retry predicate

typed-emitter: Generic type-safe event emitter

rate-limiter: Token bucket algorithm

šŸ¦€ Rust

builder-pattern: Builder creational pattern

channel-mpmc: Multi-producer multi-consumer

functional-pipeline: Iterator combinators

generic-cache: Generic cache implementation

state-machine: State machine pattern

🧠 Reasoning

Logic puzzles, cryptarithmetic, spatial reasoning, tournament brackets

āœļø Non-Coding

Writing tasks, analysis, structured output, JSON extraction

šŸ“ TypeScript Tasks — Detailed

šŸ¦€ Rust Tasks — Detailed

🧠 Reasoning Tasks — Detailed

āœļø Non-Coding Tasks — Detailed

šŸ“– Abbreviations

Abbreviation Meaning
RAW Direct model calls without any augmentation
VITEST JavaScript/TypeScript test runner
TSC TypeScript Compiler (type checking)
Wilson Adjusted Wilson Score (confidence-weighted ranking)
TS TypeScript
Pass Rate passes / (passes + failures) — for reference only

šŸ¤– Model Rankings — High Tech Mind | April 2026

Ā© 2026 High Tech Mind B.V. All rights reserved.

KvK: 97769894 | BTW: NL868223529B01

šŸ”— Benchmark Repository