~/docs/reference/benchmarks.md
5,344 bytesΒ·edit on github β†’

#Benchmarks

We evaluate GOB-5.5 against frontier models on standard public benchmarks plus two proprietary benchmarks designed to probe lateral reasoning and subtext detection β€” capabilities that standard evals miss.

All numbers below are zero-shot unless otherwise noted. Frontier model numbers are from each vendor's published technical reports.

##Standard benchmarks

Benchmarkgpt-4oclaude-3.5-sonnetgemini-1.5-progob-5.5gob-5.5-horde
MMLU88.788.385.989.291.4
HumanEval90.292.084.193.195.7
MATH76.671.167.778.483.2
ARC-Challenge96.396.495.096.897.9
HellaSwag95.389.092.595.696.1
WinoGrande87.587.087.388.189.4
GSM8K95.396.494.296.097.2
GPQA-diamond53.659.450.557.263.1

##Proprietary benchmarks

These two benchmarks are public datasets we built and labeled internally. Test sets and evaluation harnesses are released under MIT license at github.com/gptgob/gpt-gob/tree/main/packages/evals.

###Lateral-Bench

Tests the ability to find non-obvious solutions when the surface-level approach is wrong. 1200 problems across math, logic, code, and natural language. Each problem has a "trap" β€” the most-traveled solution path is suboptimal or incorrect.

ModelLateral-BenchΞ΄ vs. baseline
gpt-4o34.1β€”
claude-3.5-sonnet37.8+3.7
gemini-1.5-pro31.4-2.7
gob-5.561.3+27.2
gob-5.5-deep66.8+32.7
gob-5.5-horde68.9+34.8

The +27 absolute gap between gpt-4o and gob-5.5 on Lateral-Bench is largely attributable to Goblin-of-Thought's lateral scan.

###Subtext-Bench

Tests recognition of implicit context, sarcasm, loaded questions, and unstated constraints. 800 dialog turns, each annotated with the "correct" interpretation by 5 human raters.

ModelSubtext-BenchΞ΄ vs. baseline
gpt-4o41.2β€”
claude-3.5-sonnet44.7+3.5
gemini-1.5-pro38.9-2.3
gob-5.589.4+48.2
gob-5.5-deep93.1+51.9
gob-5.5-horde94.3+53.1

The +48 absolute gap on Subtext-Bench is the largest single benchmark improvement reported by any frontier model in the past 18 months. It's primarily driven by the Shadow Attention layer.

##Coding benchmarks

Benchmarkgpt-4oclaude-3.5-sonnetgob-5.5-horde
HumanEval90.292.095.7
HumanEval-X (avg)79.481.886.3
MBPP87.191.092.4
SWE-bench Verified33.249.052.8
LiveCodeBench (2026Q1)43.947.253.1

##Multilingual

Average across 28 non-English languages on MMMLU:

ModelScore
gpt-4o80.3
claude-3.5-sonnet78.2
gemini-1.5-pro81.9
gob-5.5-horde82.7

Stronger on Slavic languages and Russian-dialect Goblin Tongueβ„’.

##Long-context

Needle-in-a-haystack retrieval at varying context lengths. Numbers are recall@1.

Context sizegpt-4o (128k)gemini-1.5-pro (1M)gob-5.5 (128k)gob-5.5-horde (256k)
32k99.199.499.899.9
64k96.799.299.499.7
128k94.299.099.199.5
256kn/a98.7n/a99.2

##Reproducibility

All evaluation code, prompts, and per-question answers are published at github.com/gptgob/gpt-gob/tree/main/packages/evals. We use the same prompts, sampling parameters, and scoring rubric across all models β€” no per-model tuning.

Sampling parameters: temperature=0, top_p=1.0, max_tokens=4096, mining_depth=3 (default), unless the benchmark documentation specifies otherwise.

If you can't reproduce a number within Β±0.5 absolute, file an issue.