#Horde Routing
Horde Routing is GOB-5.5's dynamic mixture-of-experts (MoE) implementation. Inspired by how a goblin raiding party self-organizes — fast, chaotic, and surprisingly effective — Horde Routing assembles a custom subset of model parameters for every individual query.
##Standard MoE vs. Horde Routing
In a classic MoE setup, you have N "experts" (parameter subsets), each pre-specialized during training. A small router network picks K of them per token. The experts are static.
Horde Routing inverts this. There are no pre-defined experts. Instead, the Raid Planner — a tiny ~50M-parameter router — analyzes the query and assembles a custom horde of parameter clusters on-the-fly. The horde size and composition vary per request:
| Mode | Horde size | Composition |
|---|---|---|
focused | 4–8 clusters | Highly specialized to query type |
auto | 7–12 clusters | Default. Balances specialization and breadth. |
broad | 14–24 clusters | Maximum diversity. Use for ambiguous or multi-domain queries. |
##The `horde_mode` parameter
{
"model": "gob-5.5-horde",
"horde_mode": "broad",
"messages": [
{"role": "user", "content": "design a system that scales to 100M users"}
]
}###`focused` (small raid)
The Raid Planner picks a tight cluster of highly relevant experts. Faster, cheaper, narrower. Use it for:
- ▸Code generation in a known language
- ▸Translation
- ▸Single-domain Q&A
###`auto` (default)
Balanced. The Raid Planner picks a moderate cluster size that's specialized but not narrow.
###`broad` (full horde)
The Raid Planner picks a large, diverse cluster. Slower, pricier, broader. Use it for:
- ▸Cross-domain reasoning ("explain X in terms of Y")
- ▸Open-ended creative tasks
- ▸Anything where you don't know in advance what kind of expertise you need
- ▸Agentic workflows where the model decomposes a problem itself
##Latency vs. quality
Counterintuitively, Horde Routing reduces per-token latency on gob-5.5-horde (405B params) compared to a dense 405B model. Only the activated cluster is loaded into the inference accelerator's working memory per step — the rest of the parameters stay cold.
Effective latency is closer to a 70B dense model:
| Model | Active params/token | Latency (p50 ttft) |
|---|---|---|
| Dense 405B | 405B | ~1800ms |
gob-5.5-horde | ~75B | ~520ms |
gob-5.5 (dense) | 70B | ~320ms |
##Reading the response
{
"usage": {
"horde_size": 11,
"horde_clusters": [
"code-py-3", "code-py-1", "math-symbolic", "lang-en-formal",
"domain-systems", "domain-databases", "reasoning-multi-hop",
"domain-distributed", "code-style-functional",
"domain-perf", "general-broad-7"
]
}
}horde_clusters lets you see which expert clusters the Raid Planner picked. Useful for understanding why the model gave a particular answer style.
##Interaction with other parameters
- ▸Horde Routing only applies to `gob-5.5-horde`. On other models,
horde_modeis silently ignored. - ▸`horde_mode: "broad"` works well with `mining_depth: 7` — broad horde + deep mining is the maximum-quality setup.
- ▸Don't combine `focused` with `broad` queries. A
focusedhorde will give worse answers on ambiguous prompts thanauto.
##Why "horde"
Goblin raiding parties are famously disorganized but get the job done. The Raid Planner mirrors that: it doesn't carefully select the optimal experts (which is NP-hard at this scale), it grabs a plausible set, validates against a quick coherence check, and ships it. If the horde is bad, the model self-corrects mid-generation by triggering a re-route.
##Caveats
- ▸The Raid Planner is itself a learned model and can occasionally pick poorly. If you see degenerate outputs, retry — the second roll usually picks a different horde.
- ▸Horde Routing is non-deterministic even at
temperature=0. Small variations in clustering produce small variations in output. For full determinism, sethorde_mode: "focused"andseedto a fixed integer.