~/docs/reference/rate_limits.md
4,414 bytes·edit on github →

#Rate Limits

GPT-GOB enforces rate limits across three dimensions:

  • RPM — requests per minute
  • TPM — tokens per minute (input + output combined)
  • Concurrent — simultaneous in-flight requests per API key

Limits are per API key, not per project or organization. They apply to all endpoints, but the chat completions endpoint usually hits a limit first.

##Tiers

Tiers are assigned automatically based on usage history and prepaid balance. New keys start at tier-1. After 7 days of activity and $50+ in spend, you're auto-promoted.

TierAuto-promotion criteriaRPMTPMConcurrent
freegob-test- keys10040,0005
tier-1New live keys500200,00010
tier-27d active + $50 spend2,0001,000,00020
tier-330d active + $500 spend5,0005,000,00050
tier-490d active + $5000 spend10,00020,000,000100
enterpriseContact salescustomcustomcustom

##Per-model limits

Limits scale by model. The numbers above are for gob-5.5. For other models, multiply:

ModelRPM/TPM multiplier
gob-5.5-scout
gob-5.5
gob-5.5-deep0.5×
gob-5.5-horde0.3×

So a tier-2 key gets 10,000 RPM on scout, 2,000 RPM on gob-5.5, and 600 RPM on gob-5.5-horde.

##Headers

Every API response includes rate limit info:

http
X-RateLimit-Limit-Requests: 2000
X-RateLimit-Limit-Tokens: 1000000
X-RateLimit-Remaining-Requests: 1847
X-RateLimit-Remaining-Tokens: 892341
X-RateLimit-Reset-Requests: 23s
X-RateLimit-Reset-Tokens: 48s

Use these to drive your client-side throttling rather than reactively handling 429s.

##When you hit a limit

You get a 429 with one of:

  • gremlin_in_the_pipes — RPM limit hit
  • too_many_tokens — TPM limit hit
  • too_many_concurrent — concurrent limit hit
  • no_treasure_in_hoard — monthly billing quota exhausted (not a rate limit per se, but returned the same way)

The response body includes retry_after in seconds:

json
{
  "error": {
    "type": "rate_limit_exceeded",
    "code": "gremlin_in_the_pipes",
    "message": "too many requests, tall one. come back in 18s.",
    "retry_after": 18
  }
}

##Backoff strategy

Use exponential backoff with jitter, respecting retry_after when present:

python
def wait_time(attempt: int, retry_after: float | None) -> float:
    if retry_after:
        return retry_after + random.random()
    return min(60, 2 ** attempt + random.random())

Don't retry indefinitely. After ~5 attempts, surface the error.

##Increasing limits

Three options:

  1. 01.Wait for auto-promotion. Check your tier in Console → Limits.
  2. 02.Prepay credits. Each $100 prepaid moves you up one tier (capped at tier-4).
  3. 03.Contact sales for enterprise. Custom limits, dedicated capacity, SLA. Required if you need >10k RPM, >20M TPM, or guaranteed availability.

##Quotas vs. rate limits

ConceptWindowResetTied to
Rate limit1 minuterollingper API key
Spend quota1 monthcalendar monthper organization
Hard caplifetimen/aper organization

You hit rate limits when you go too fast. You hit quotas when you spend too much in a month. Different errors, different remediation.

##Best practices

  • Throttle proactively using response headers, not reactively on 429s.
  • Batch requests where possible (e.g. embeddings). One request with 100 inputs counts as 1 toward RPM.
  • Use `gob-5.5-scout` for high-volume calls that don't need flagship quality. 5× the RPM ceiling.
  • Spread across multiple keys for parallelism. Each key has its own concurrent limit.
  • Set `max_tokens` aggressively. TPM is calculated optimistically (max possible) for streaming requests until the actual count is known.