#Rate Limits
GPT-GOB enforces rate limits across three dimensions:
- ▸RPM — requests per minute
- ▸TPM — tokens per minute (input + output combined)
- ▸Concurrent — simultaneous in-flight requests per API key
Limits are per API key, not per project or organization. They apply to all endpoints, but the chat completions endpoint usually hits a limit first.
##Tiers
Tiers are assigned automatically based on usage history and prepaid balance. New keys start at tier-1. After 7 days of activity and $50+ in spend, you're auto-promoted.
| Tier | Auto-promotion criteria | RPM | TPM | Concurrent |
|---|---|---|---|---|
free | gob-test- keys | 100 | 40,000 | 5 |
tier-1 | New live keys | 500 | 200,000 | 10 |
tier-2 | 7d active + $50 spend | 2,000 | 1,000,000 | 20 |
tier-3 | 30d active + $500 spend | 5,000 | 5,000,000 | 50 |
tier-4 | 90d active + $5000 spend | 10,000 | 20,000,000 | 100 |
enterprise | Contact sales | custom | custom | custom |
##Per-model limits
Limits scale by model. The numbers above are for gob-5.5. For other models, multiply:
| Model | RPM/TPM multiplier |
|---|---|
gob-5.5-scout | 5× |
gob-5.5 | 1× |
gob-5.5-deep | 0.5× |
gob-5.5-horde | 0.3× |
So a tier-2 key gets 10,000 RPM on scout, 2,000 RPM on gob-5.5, and 600 RPM on gob-5.5-horde.
##Headers
Every API response includes rate limit info:
X-RateLimit-Limit-Requests: 2000X-RateLimit-Limit-Tokens: 1000000X-RateLimit-Remaining-Requests: 1847X-RateLimit-Remaining-Tokens: 892341X-RateLimit-Reset-Requests: 23sX-RateLimit-Reset-Tokens: 48sUse these to drive your client-side throttling rather than reactively handling 429s.
##When you hit a limit
You get a 429 with one of:
- ▸
gremlin_in_the_pipes— RPM limit hit - ▸
too_many_tokens— TPM limit hit - ▸
too_many_concurrent— concurrent limit hit - ▸
no_treasure_in_hoard— monthly billing quota exhausted (not a rate limit per se, but returned the same way)
The response body includes retry_after in seconds:
{
"error": {
"type": "rate_limit_exceeded",
"code": "gremlin_in_the_pipes",
"message": "too many requests, tall one. come back in 18s.",
"retry_after": 18
}
}##Backoff strategy
Use exponential backoff with jitter, respecting retry_after when present:
def wait_time(attempt: int, retry_after: float | None) -> float:
if retry_after:
return retry_after + random.random()
return min(60, 2 ** attempt + random.random())Don't retry indefinitely. After ~5 attempts, surface the error.
##Increasing limits
Three options:
- 01.Wait for auto-promotion. Check your tier in Console → Limits.
- 02.Prepay credits. Each $100 prepaid moves you up one tier (capped at
tier-4). - 03.Contact sales for enterprise. Custom limits, dedicated capacity, SLA. Required if you need >10k RPM, >20M TPM, or guaranteed availability.
##Quotas vs. rate limits
| Concept | Window | Reset | Tied to |
|---|---|---|---|
| Rate limit | 1 minute | rolling | per API key |
| Spend quota | 1 month | calendar month | per organization |
| Hard cap | lifetime | n/a | per organization |
You hit rate limits when you go too fast. You hit quotas when you spend too much in a month. Different errors, different remediation.
##Best practices
- ▸Throttle proactively using response headers, not reactively on 429s.
- ▸Batch requests where possible (e.g. embeddings). One request with 100 inputs counts as 1 toward RPM.
- ▸Use `gob-5.5-scout` for high-volume calls that don't need flagship quality. 5× the RPM ceiling.
- ▸Spread across multiple keys for parallelism. Each key has its own concurrent limit.
- ▸Set `max_tokens` aggressively. TPM is calculated optimistically (max possible) for streaming requests until the actual count is known.