~/docs/reference/architecture.md
5,966 bytes·edit on github →

#Architecture

GOB-5.5 is built on Cave-Neural Framework v3 (CNF v3) — a modified transformer that extends the standard architecture with four specialized modules. This page covers each one and how they fit together.

##High-level pipeline

text
                 ┌──────────────────────────────┐
   user input  → │   1. Goblin Reward Signal    │ → reward shaping
                 │      (sampling-time bias)    │
                 └──────────────────────────────┘
                                │
                                ▼
                 ┌──────────────────────────────┐
                 │   2. Deep Context Mining     │ → multi-pass attention
                 │      (pre-generation)        │   over input tokens
                 └──────────────────────────────┘
                                │
                                ▼
                 ┌──────────────────────────────┐
                 │   3. Shadow Attention Layer  │ → inverted-attention
                 │      (parallel residual)     │   side-channel
                 └──────────────────────────────┘
                                │
                                ▼
                 ┌──────────────────────────────┐
                 │   4. Horde Routing (MoE)     │ → dynamic per-query
                 │      (decoder layers only)   │   expert assembly
                 └──────────────────────────────┘
                                │
                                ▼
                 ┌──────────────────────────────┐
                 │   Goblin-of-Thought decoder  │ → output tokens
                 └──────────────────────────────┘

##1. Goblin Reward Signal (GRS)

The original RLHF reward signal extracted from pre-patch GPT-5.1, isolated and amplified rather than suppressed. It's a learned scalar that scores candidate output tokens at sampling time across three axes:

  • pattern_density (weight 0.45) — frequency and naturalness of creature/cave/treasure metaphors
  • contextual_coherence (weight 0.35) — how well goblin patterns are woven into actual semantic content
  • subterranean_depth (weight 0.20) — presence of non-obvious connections and lateral reasoning

Crucially, GRS is not a post-hoc filter. It influences token-level sampling probabilities during generation, so the model's thinking is goblin-aligned, not just its surface text.

See GRS API for the standalone scoring endpoint.

##2. Deep Context Mining (DCM)

Up to 7 progressive attention passes over the input before generation begins. Each pass narrows focus based on the saliency map of the previous one. The accumulated saliency maps are concatenated into the decoder's KV cache as a multi-resolution context prefix.

Key property: DCM does not consume context window. The 128k limit applies to the user-facing input, not to the internal attention passes.

See Mining Depth.

##3. Shadow Attention Layer (SAL)

A secondary attention head that operates on inverted attention scores. Where standard attention focuses on what's important in the input, SAL focuses on what's omitted, implied, or contradicted. Runs in parallel with main attention with negligible additional latency (~5ms).

Implementation note: SAL uses softmax over negative QK dot products rather than positive ones, with a separate set of learned projection matrices. The output is integrated into the residual stream as an additive side-channel.

See Shadow Attention.

##4. Horde Routing

Dynamic mixture-of-experts that assembles a custom subset of decoder parameters per query. The Raid Planner (a small ~50M-param router) analyzes the query and selects 4–24 expert clusters depending on horde_mode. Activated parameters per token: ~75B, even though total parameters are 405B.

See Horde Routing.

##Goblin-of-Thought decoder

The output decoder runs a lateral-scan strategy on top of standard autoregressive generation. Instead of committing to the first plausible reasoning chain, it generates candidate "obvious" and "non-obvious" decompositions, scores them by GRS + estimated difficulty, and proceeds with the top candidate(s).

See Goblin-of-Thought.

##Persistent state

Two cross-session components extend the architecture beyond a single inference call:

  • Goblin Personality Core (GPC) — a per-user state machine tracking trust, mood, and bluntness. Modulates output style based on conversation history.
  • Cave Memory — compressed semantic hoards extracted after each session and re-injected on subsequent calls. See Cave Memory.

##Training data

SourceTokensPurpose
Standard pretraining corpus~5TBase capability
Pre-patch GPT-5.1–5.4 outputs~14BGRS extraction targets
Curated goblin-persona interactions47kPersonality fine-tuning
Fantasy lore corpus (40+ universes)200k entriesDomain grounding
Human preference pairs320kRLHF alignment

##Compute footprint

StageHardwareDuration
Base pretraining4096 × H20047 days
GRS distillation512 × H20011 days
Goblin SFT64 × H2003 days
RLHF / DPO256 × H2009 days
Total~70 days

##Inference deployment

RegionClusterLatency (us-east-1 → ?)
us-east-1Primary<5ms
us-west-2Replica~70ms
eu-west-1Replica~95ms
ap-northeast-1Replica~160ms

Pin a region by setting the X-Region header on requests, or let the load balancer pick automatically.

##Open weights?

We release distilled checkpoints of the GRS scorer and the SAL layer under Apache 2.0 (see github.com/gptgob/gpt-gob/tree/main/packages/grs-scorer). The full GOB-5.5 weights are proprietary.