~/docs/guides/horde_routing.md
4,242 bytes·edit on github →

#Horde Routing

Horde Routing is GOB-5.5's dynamic mixture-of-experts (MoE) implementation. Inspired by how a goblin raiding party self-organizes — fast, chaotic, and surprisingly effective — Horde Routing assembles a custom subset of model parameters for every individual query.

##Standard MoE vs. Horde Routing

In a classic MoE setup, you have N "experts" (parameter subsets), each pre-specialized during training. A small router network picks K of them per token. The experts are static.

Horde Routing inverts this. There are no pre-defined experts. Instead, the Raid Planner — a tiny ~50M-parameter router — analyzes the query and assembles a custom horde of parameter clusters on-the-fly. The horde size and composition vary per request:

ModeHorde sizeComposition
focused4–8 clustersHighly specialized to query type
auto7–12 clustersDefault. Balances specialization and breadth.
broad14–24 clustersMaximum diversity. Use for ambiguous or multi-domain queries.

##The `horde_mode` parameter

json
{
  "model": "gob-5.5-horde",
  "horde_mode": "broad",
  "messages": [
    {"role": "user", "content": "design a system that scales to 100M users"}
  ]
}

###`focused` (small raid)

The Raid Planner picks a tight cluster of highly relevant experts. Faster, cheaper, narrower. Use it for:

  • Code generation in a known language
  • Translation
  • Single-domain Q&A

###`auto` (default)

Balanced. The Raid Planner picks a moderate cluster size that's specialized but not narrow.

###`broad` (full horde)

The Raid Planner picks a large, diverse cluster. Slower, pricier, broader. Use it for:

  • Cross-domain reasoning ("explain X in terms of Y")
  • Open-ended creative tasks
  • Anything where you don't know in advance what kind of expertise you need
  • Agentic workflows where the model decomposes a problem itself

##Latency vs. quality

Counterintuitively, Horde Routing reduces per-token latency on gob-5.5-horde (405B params) compared to a dense 405B model. Only the activated cluster is loaded into the inference accelerator's working memory per step — the rest of the parameters stay cold.

Effective latency is closer to a 70B dense model:

ModelActive params/tokenLatency (p50 ttft)
Dense 405B405B~1800ms
gob-5.5-horde~75B~520ms
gob-5.5 (dense)70B~320ms

##Reading the response

json
{
  "usage": {
    "horde_size": 11,
    "horde_clusters": [
      "code-py-3", "code-py-1", "math-symbolic", "lang-en-formal",
      "domain-systems", "domain-databases", "reasoning-multi-hop",
      "domain-distributed", "code-style-functional",
      "domain-perf", "general-broad-7"
    ]
  }
}

horde_clusters lets you see which expert clusters the Raid Planner picked. Useful for understanding why the model gave a particular answer style.

##Interaction with other parameters

  • Horde Routing only applies to `gob-5.5-horde`. On other models, horde_mode is silently ignored.
  • `horde_mode: "broad"` works well with `mining_depth: 7` — broad horde + deep mining is the maximum-quality setup.
  • Don't combine `focused` with `broad` queries. A focused horde will give worse answers on ambiguous prompts than auto.

##Why "horde"

Goblin raiding parties are famously disorganized but get the job done. The Raid Planner mirrors that: it doesn't carefully select the optimal experts (which is NP-hard at this scale), it grabs a plausible set, validates against a quick coherence check, and ships it. If the horde is bad, the model self-corrects mid-generation by triggering a re-route.

##Caveats

  • The Raid Planner is itself a learned model and can occasionally pick poorly. If you see degenerate outputs, retry — the second roll usually picks a different horde.
  • Horde Routing is non-deterministic even at temperature=0. Small variations in clustering produce small variations in output. For full determinism, set horde_mode: "focused" and seed to a fixed integer.