#Benchmarks
We evaluate GOB-5.5 against frontier models on standard public benchmarks plus two proprietary benchmarks designed to probe lateral reasoning and subtext detection β capabilities that standard evals miss.
All numbers below are zero-shot unless otherwise noted. Frontier model numbers are from each vendor's published technical reports.
##Standard benchmarks
| Benchmark | gpt-4o | claude-3.5-sonnet | gemini-1.5-pro | gob-5.5 | gob-5.5-horde |
|---|---|---|---|---|---|
| MMLU | 88.7 | 88.3 | 85.9 | 89.2 | 91.4 |
| HumanEval | 90.2 | 92.0 | 84.1 | 93.1 | 95.7 |
| MATH | 76.6 | 71.1 | 67.7 | 78.4 | 83.2 |
| ARC-Challenge | 96.3 | 96.4 | 95.0 | 96.8 | 97.9 |
| HellaSwag | 95.3 | 89.0 | 92.5 | 95.6 | 96.1 |
| WinoGrande | 87.5 | 87.0 | 87.3 | 88.1 | 89.4 |
| GSM8K | 95.3 | 96.4 | 94.2 | 96.0 | 97.2 |
| GPQA-diamond | 53.6 | 59.4 | 50.5 | 57.2 | 63.1 |
##Proprietary benchmarks
These two benchmarks are public datasets we built and labeled internally. Test sets and evaluation harnesses are released under MIT license at github.com/gptgob/gpt-gob/tree/main/packages/evals.
###Lateral-Bench
Tests the ability to find non-obvious solutions when the surface-level approach is wrong. 1200 problems across math, logic, code, and natural language. Each problem has a "trap" β the most-traveled solution path is suboptimal or incorrect.
| Model | Lateral-Bench | Ξ΄ vs. baseline |
|---|---|---|
| gpt-4o | 34.1 | β |
| claude-3.5-sonnet | 37.8 | +3.7 |
| gemini-1.5-pro | 31.4 | -2.7 |
| gob-5.5 | 61.3 | +27.2 |
| gob-5.5-deep | 66.8 | +32.7 |
| gob-5.5-horde | 68.9 | +34.8 |
The +27 absolute gap between gpt-4o and gob-5.5 on Lateral-Bench is largely attributable to Goblin-of-Thought's lateral scan.
###Subtext-Bench
Tests recognition of implicit context, sarcasm, loaded questions, and unstated constraints. 800 dialog turns, each annotated with the "correct" interpretation by 5 human raters.
| Model | Subtext-Bench | Ξ΄ vs. baseline |
|---|---|---|
| gpt-4o | 41.2 | β |
| claude-3.5-sonnet | 44.7 | +3.5 |
| gemini-1.5-pro | 38.9 | -2.3 |
| gob-5.5 | 89.4 | +48.2 |
| gob-5.5-deep | 93.1 | +51.9 |
| gob-5.5-horde | 94.3 | +53.1 |
The +48 absolute gap on Subtext-Bench is the largest single benchmark improvement reported by any frontier model in the past 18 months. It's primarily driven by the Shadow Attention layer.
##Coding benchmarks
| Benchmark | gpt-4o | claude-3.5-sonnet | gob-5.5-horde |
|---|---|---|---|
| HumanEval | 90.2 | 92.0 | 95.7 |
| HumanEval-X (avg) | 79.4 | 81.8 | 86.3 |
| MBPP | 87.1 | 91.0 | 92.4 |
| SWE-bench Verified | 33.2 | 49.0 | 52.8 |
| LiveCodeBench (2026Q1) | 43.9 | 47.2 | 53.1 |
##Multilingual
Average across 28 non-English languages on MMMLU:
| Model | Score |
|---|---|
| gpt-4o | 80.3 |
| claude-3.5-sonnet | 78.2 |
| gemini-1.5-pro | 81.9 |
| gob-5.5-horde | 82.7 |
Stronger on Slavic languages and Russian-dialect Goblin Tongueβ’.
##Long-context
Needle-in-a-haystack retrieval at varying context lengths. Numbers are recall@1.
| Context size | gpt-4o (128k) | gemini-1.5-pro (1M) | gob-5.5 (128k) | gob-5.5-horde (256k) |
|---|---|---|---|---|
| 32k | 99.1 | 99.4 | 99.8 | 99.9 |
| 64k | 96.7 | 99.2 | 99.4 | 99.7 |
| 128k | 94.2 | 99.0 | 99.1 | 99.5 |
| 256k | n/a | 98.7 | n/a | 99.2 |
##Reproducibility
All evaluation code, prompts, and per-question answers are published at github.com/gptgob/gpt-gob/tree/main/packages/evals. We use the same prompts, sampling parameters, and scoring rubric across all models β no per-model tuning.
Sampling parameters: temperature=0, top_p=1.0, max_tokens=4096, mining_depth=3 (default), unless the benchmark documentation specifies otherwise.
If you can't reproduce a number within Β±0.5 absolute, file an issue.