#Rate Limits

GPT-GOB enforces rate limits across three dimensions:

▸RPM — requests per minute
▸TPM — tokens per minute (input + output combined)
▸Concurrent — simultaneous in-flight requests per API key

Limits are per API key, not per project or organization. They apply to all endpoints, but the chat completions endpoint usually hits a limit first.

##Tiers

Tiers are assigned automatically based on usage history and prepaid balance. New keys start at tier-1. After 7 days of activity and $50+ in spend, you're auto-promoted.

Tier	Auto-promotion criteria	RPM	TPM	Concurrent
`free`	`gob-test-` keys	100	40,000	5
`tier-1`	New live keys	500	200,000	10
`tier-2`	7d active + $50 spend	2,000	1,000,000	20
`tier-3`	30d active + $500 spend	5,000	5,000,000	50
`tier-4`	90d active + $5000 spend	10,000	20,000,000	100
`enterprise`	Contact sales	custom	custom	custom

##Per-model limits

Limits scale by model. The numbers above are for gob-5.5. For other models, multiply:

Model	RPM/TPM multiplier
`gob-5.5-scout`	5×
`gob-5.5`	1×
`gob-5.5-deep`	0.5×
`gob-5.5-horde`	0.3×

So a tier-2 key gets 10,000 RPM on scout, 2,000 RPM on gob-5.5, and 600 RPM on gob-5.5-horde.

##Headers

Every API response includes rate limit info:

http

X-RateLimit-Limit-Requests: 2000
X-RateLimit-Limit-Tokens: 1000000
X-RateLimit-Remaining-Requests: 1847
X-RateLimit-Remaining-Tokens: 892341
X-RateLimit-Reset-Requests: 23s
X-RateLimit-Reset-Tokens: 48s

Use these to drive your client-side throttling rather than reactively handling 429s.

##When you hit a limit

You get a 429 with one of:

▸gremlin_in_the_pipes — RPM limit hit
▸too_many_tokens — TPM limit hit
▸too_many_concurrent — concurrent limit hit
▸no_treasure_in_hoard — monthly billing quota exhausted (not a rate limit per se, but returned the same way)

The response body includes retry_after in seconds:

json

{
  "error": {
    "type": "rate_limit_exceeded",
    "code": "gremlin_in_the_pipes",
    "message": "too many requests, tall one. come back in 18s.",
    "retry_after": 18
  }
}

##Backoff strategy

Use exponential backoff with jitter, respecting retry_after when present:

python

def wait_time(attempt: int, retry_after: float | None) -> float:
    if retry_after:
        return retry_after + random.random()
    return min(60, 2 ** attempt + random.random())

Don't retry indefinitely. After ~5 attempts, surface the error.

##Increasing limits

Three options:

01.Wait for auto-promotion. Check your tier in Console → Limits.
02.Prepay credits. Each $100 prepaid moves you up one tier (capped at tier-4).
03.Contact sales for enterprise. Custom limits, dedicated capacity, SLA. Required if you need >10k RPM, >20M TPM, or guaranteed availability.

##Quotas vs. rate limits

Concept	Window	Reset	Tied to
Rate limit	1 minute	rolling	per API key
Spend quota	1 month	calendar month	per organization
Hard cap	lifetime	n/a	per organization

You hit rate limits when you go too fast. You hit quotas when you spend too much in a month. Different errors, different remediation.

##Best practices

▸Throttle proactively using response headers, not reactively on 429s.
▸Batch requests where possible (e.g. embeddings). One request with 100 inputs counts as 1 toward RPM.
▸Use `gob-5.5-scout` for high-volume calls that don't need flagship quality. 5× the RPM ceiling.
▸Spread across multiple keys for parallelism. Each key has its own concurrent limit.
▸Set `max_tokens` aggressively. TPM is calculated optimistically (max possible) for streaming requests until the actual count is known.