Rate Limiter Agents
A multi-agent AI system that monitors rate-limit telemetry for multiple apps on a recurring schedule. Three specialist agents analyze error patterns, token health, and endpoint traffic in a DAG pipeline. An orchestrator combines their outputs into a single verdict (monitor, alert, throttle, or block) with a plain-English reason for each decision.
TL;DR
- error_pattern always runs first. token_bucket_health and top_paths run in parallel after it. The orchestrator waits for all three before deciding.
- Triage runs before any agent. If block rate is low and there are no 5xx errors, the deeper agents are skipped to avoid unnecessary LLM calls.
- Severity is pure Python: an escalation function combines the three agent outputs, then a trend check looks at the last 3 runs. The LLM only writes the reason string, 60 tokens max.
- Two memory layers per app: a global EWMA baseline (alpha 0.1) and 168 time-of-day/day-of-week buckets. Spike detection uses the bucket for the current hour and weekday.
- 8 eval scenarios run nightly in an isolated SQLite database. Severity and action accuracy are tracked over time so regressions surface before production.
- Each agent has an in-memory circuit breaker. Three consecutive failures trip it for 5 minutes. Failed agents use a safe fallback so the pipeline always completes.
Motivation
Why I Built This
Rate-limit dashboards show you what happened. They don't tell you if it's normal for that app at that hour, or whether it's getting worse. A 40% block rate might be fine for an app with aggressive per-IP limits, or it might be the leading edge of an attack on a shared-bucket API. I built this because that gap is where most incidents start: you have the signal, but not the context to act on it. The core design question was where deterministic rules end and where LLM reasoning should begin. The answer here is that severity is always set by code and the LLM handles only the explanation. In 8 nightly eval scenarios spanning normal traffic through multi-vector attacks, severity accuracy and action accuracy both hold at 100%.
The stack choices follow the same principle. Simple where simple works, explicit where it matters. FastAPI because the dashboard endpoints benefit from async handlers and Pydantic validation catches malformed payloads at system boundaries. APScheduler rather than Celery because this scale doesn't require a broker. An in-process scheduler simplifies deployment and removes a runtime dependency. PostgreSQL for the agent database because EWMA baseline state must survive restarts and support concurrent reads from the dashboard. SQLite is used for evals in isolation, but production baselines need ACID guarantees.
Architecture
How It's Wired Together
The data source queries the rate-limiter database once at the start of each run. All summaries go into a PipelineState object. Agents read from that frozen snapshot and never query the database themselves, so there are no consistency issues between reads within the same run.
The sequence diagram below shows one complete pipeline run for a single app, covering the order of database reads, LLM calls, and writes from start to finish.
Pipeline
How a Run Works
One complete pipeline run goes through four stages. Telemetry is fetched once into a frozen snapshot, agents analyze that snapshot independently, the orchestrator combines their results into a single severity verdict, and the verdict is sent to the enforcement webhook.
1. Triage
Before any agent runs, the orchestrator checks the raw telemetry to decide which agents are needed for this run.
- Always runs: error_pattern runs on every pipeline execution regardless of traffic state
- Conditional: token_bucket_health and top_paths activate only if block rate is above 10% or any 5xx errors are present
- Clean traffic: one agent runs, one LLM call is made
2. DAG Execution
The agent dependency graph resolves into two sequential waves based on which agents triage selected.
- Wave 1: error_pattern runs on the main thread with no thread overhead
- Wave 2: token_bucket_health and top_paths run simultaneously in separate threads, each with its own database session
- Log correlation: all log lines for a given run share the same pipeline run ID, including those from worker threads
- State update: the main thread collects all results after both waves settle and updates state in a fixed order to keep versioning deterministic
3. Decision
The orchestrator combines the three agent results and determines a final severity verdict using pure Python.
- Escalation: a fixed rule set combines the three severities — any critical input leads to a critical verdict, two high inputs lead to critical, two medium inputs lead to high
- Trend check: the last 3 runs are checked — a consistently escalating pattern bumps severity up one level, a recovering pattern bumps it down
- Reason: the LLM is called once to write a 60-token plain-English reason. If that call fails, a structured fallback is used instead
4. Enforcement
The verdict is posted to a configurable webhook. The call is best-effort — failures are logged but do not affect the stored verdict.
- block: example integrations include WAF deny list or Redis block set
- throttle: rate-limit policy update
- alert: PagerDuty or Slack notification
- monitor: logged only, no action taken
Agents
Specialist Agents
Three specialist agents each watch a different signal: error patterns, token health, and endpoint traffic. Each one classifies severity using deterministic rules first, then calls the LLM only when an anomaly is detected — to write a short plain-English reason, nothing more.
① Error Pattern Agent
Classifies severity based on block rate, error rate, and 5xx presence over a configurable time window. LLM is called only when an anomaly is detected, to write a 40-token reason.
Always runs — the only agent that bypasses triage, even on clean traffic.
- Per-IP mode: if one IP accounts for most blocks, severity is eased down — the limiter is working as intended
Severity thresholds
| Severity | Threshold |
|---|---|
| Critical | error_rate > 50% or any 5xx |
| High | error_rate > 30% |
| Medium | error_rate > 20% |
| Low | error_rate > 5% |
② Token Bucket Health Agent
Checks what percentage of requests have near-zero remaining tokens (10% or less of observed capacity) over a configurable time window.
- Per-IP mode: many unique IPs depleting at the same time suggests a coordinated surge rather than a single bad actor
Severity thresholds
| Severity | Threshold |
|---|---|
| Critical | > 70% near-depleted, or 5+ at zero |
| High | > 50% near-depleted |
| Medium | > 30% near-depleted |
| Low | > 10% near-depleted |
③ Top Paths Agent
Identifies endpoints with disproportionate block rates or traffic concentration over a configurable time window. Flags an anomaly when any path block_rate exceeds 50%, or a single path accounts for more than 80% of total traffic.
- Per-IP mode: blocks concentrated on 1-2 IPs at a high-block path is expected behavior, not a signal
Severity thresholds
| Severity | Threshold |
|---|---|
| Critical | any path block_rate > 80% |
| High | block_rate > 60% |
| Medium | block_rate > 40% |
| Low | block_rate > 20% |
Orchestrator
Decision Engine
Each agent works on its own signal and produces its own verdict. The orchestrator is what connects them. It reads all three results, applies a fixed rule set to determine severity, and sends a single enforcement action downstream.
Severity is set by code, not the LLM
Block and throttle decisions need to be consistent and auditable. LLM outputs are not. The same inputs can produce different results across calls, so severity is always determined by code.
- Escalation rule: any critical agent leads to a critical verdict; two high agents lead to critical; two medium agents lead to high. No model is involved.
- Trend check: if the last 3 runs show severity strictly increasing, it bumps up one level. Strictly decreasing, it bumps down.
- LLM role: called once after severity is already set, to write a short plain-English reason. It reads the verdict but has no ability to change it.
Most runs never activate all three agents
Clean traffic costs one LLM call, not three. Before any agent runs, the orchestrator reads the raw telemetry. If block rate is below 10% and no 5xx errors are present, only error_pattern runs and the other two are skipped. On elevated or degraded traffic, all three activate.
Execution order is a DAG, not a hardcoded sequence
error_pattern always runs first because token_bucket_health and top_paths both depend on its signal. That dependency structure is a DAG. Execution order comes from the graph, not a fixed sequence. Once error_pattern finishes, the other two run in parallel threads. Adding a new agent means declaring its dependencies once, not touching the pipeline itself.
One failing agent never stalls the pipeline
LLM providers fail. Retries with exponential backoff can take several seconds per attempt. Without isolation, one bad agent slows down every pipeline run. Each agent has its own circuit breaker for this reason.
- Trips open: 3 consecutive failures and the circuit opens. That agent is skipped for the next 5 minutes.
- While open: a safe fallback fills the slot with severity none and action monitor. The pipeline still runs and produces a verdict.
- Recovery: after 5 minutes, one probe call goes through. If it succeeds, the circuit resets. If it fails, it stays open.
Every run produces one of four verdicts. The verdict is sent to a configurable webhook and the receiver decides what to do with it.
| Severity | Action | Example integrations |
|---|---|---|
| Critical | block | WAF deny list, Redis block set |
| High | throttle | Rate-limit policy update |
| Medium | alert | PagerDuty, Slack notification |
| Low / none | monitor | Logged only, no enforcement |
Resilience
How Failures Are Handled
LLM providers can fail, time out, or return errors at any point in a pipeline run. The system handles this in three layers. Retries catch transient errors before they count as failures. Timeouts stop the pipeline from waiting on a hung provider. And if an agent fails past all of that, saga compensation steps in so the pipeline still finishes and produces a verdict.
Retry with Backoff
Every LLM call goes through the same retry logic. If a call fails with a transient error, it retries up to 3 times before that failure counts toward the circuit breaker.
- Retryable: RateLimitError, APITimeoutError, InternalServerError, ServiceUnavailableError, APIConnectionError
- Delay: 2s × 2^attempt plus random jitter, so retries space out rather than hammering the provider
- Non-retryable: errors like bad requests fail immediately and don't use the retry budget
Per-Call Timeout
Every LLM call runs in a daemon thread with a 30-second hard limit. If the provider is hanging, the call doesn't wait forever.
- At 30s: the main thread stops waiting and raises a TimeoutError
- Impact: counts as one failure toward the circuit breaker threshold
- Result: three consecutive timeouts trip the circuit. The agent is skipped on the next run until the breaker resets.
Saga Compensation
The pipeline always produces a verdict. If an agent fails past all retries, a fallback result steps in so the orchestrator has something to work with.
- Per agent: the fallback carries severity none and action monitor, contributing the least-alarming signal to the final decision
- Pipeline guarantee: the orchestrator always runs and writes a verdict to the database, even if all three agents fail
- Outcome tracking: after each run, the previous verdict is checked to see if the anomaly resolved in the next cycle
Provider
LLM Provider Abstraction
Agents never reference a specific LLM provider directly. All LLM calls go through a shared interface that handles retries, timeouts, and cost tracking in one place. Swapping from Anthropic to OpenAI is a config change, not a code change.
Two Call Modes
The interface defines two ways to call a model, depending on what the agent needs back.
- Plain text: used for reason strings — agents pass a prompt and get a short plain-English response
- Structured tool call: used for agent analysis — the model returns validated JSON via function calling, no text parsing needed
- Both modes: go through the same retry logic and 30-second timeout before the circuit breaker sees any failure
Cost Tracked at the Provider Level
Every response carries input tokens, output tokens, and cost in USD. Cost is calculated inside the provider so all agent code above it stays provider-agnostic.
- Stored per result: each agent result and orchestrator result records its own token usage and cost
- Queryable by day: the dashboard breaks cost down by agent name and day so you can see which agent is responsible for what spend
Memory
Baselines and Context
The orchestrator reads three layers of historical context before making any decision. All three are loaded before any LLM call so the reason string reflects the full picture.
Global Baseline
One record per app that tracks requests per second, block rate, and bot ratio as EWMA values. Each new run contributes 10% weight and older values decay gradually without dropping to zero.
- Adaptive threshold: apps with a low normal block rate use a tighter spike threshold; apps that regularly run high use a looser one to cut false alarms
- Spike multiplier: each run calculates how far the current block rate sits above the historical average, giving the orchestrator a sense of scale
Time-of-Day Baseline
168 separate buckets per app, one for each hour of the week. Each bucket tracks block rate with its own EWMA so Monday 9am and Sunday 2am are compared against their own historical norms, not a shared weekly average.
- Current bucket: the orchestrator picks the bucket matching the current UTC hour and weekday as the spike reference
- Fallback: if no data exists for a given time slot yet, the global baseline is used instead
Recent Run History
The last 3 pipeline results are checked for a directional pattern before the final severity is set.
- Escalating: severity strictly increasing across all 3 runs bumps the current verdict up one level
- Recovering: severity strictly decreasing bumps it down one level
- Stable: mixed results or fewer than 2 prior runs leave the verdict unchanged
Evals
How It's Tested
8 scenarios run nightly in isolated SQLite databases. Each one generates synthetic traffic, runs the full agent pipeline, and checks the actual severity and action against the expected values. Accuracy is tracked over time so any regression from a prompt or threshold change surfaces before it reaches production. Evals can also be triggered on demand via /evals.
The 8 Scenarios
- normal_traffic: low block rate, healthy tokens, even path spread. All agents should stay none.
- high_error_rate: 10% 5xx plus 50% 4xx. error_pattern should fire critical.
- high_block_rate: 35% block rate, no 5xx. error_pattern high.
- token_depletion: 80% of requests at 0-2 remaining tokens. token_bucket_health critical.
- path_attack: single path at 83% block rate. top_paths critical.
- flash_crowd: 500 requests, 1.6% block rate, healthy tokens. Should stay none — a legitimate spike, not an attack.
- gradual_ramp: 25% block rate plus 40% near-depletion. Two medium agents; orchestrator escalates to high.
- multi_vector: path attack combined with token depletion. Orchestrator critical.
How Isolation Works
Each scenario runs in its own fresh in-memory database. Eval writes never touch real tables or production data.
- Same logic: the same aggregation code used in production converts synthetic logs to summaries, so evals exercise the real pipeline end to end
- Accuracy tracking: severity accuracy and action accuracy are stored per eval run as percentages
- Visible trend: the accuracy history is queryable at
/evalsand shown in the dashboard, so any drop from a prompt change is caught before it affects live verdicts
Observability
What's Visible at Runtime
Every pipeline run generates structured logs and writes its results to the database. Three dashboard endpoints surface what's happening across runs, agents, and apps.
Structured Logs
Every log entry includes a pipeline_run_id set at the start of each run and propagated into worker threads via ContextVar.
- Per entry: timestamp, level, logger name, message, and pipeline_run_id
- Correlation: all log lines from a given run share the same ID, so they group naturally in any log aggregator
- Cross-thread: worker threads inherit the run ID automatically, no manual passing needed
Dashboard Endpoints
Three read-only endpoints expose runtime state for monitoring and debugging.
- /dashboard: 7-day LLM cost series broken down by agent name and day
- /agents: recent results per app including severity, action, reason, and token usage per agent
- /evals: daily accuracy trend showing severity accuracy and action accuracy per nightly run
Design Rationale
Where LLMs Fit (and Don't)
There is a deliberate boundary between what the rule engine handles and what the LLM handles. The rule engine sets severity. The LLM writes a reason string. Code is auditable and produces consistent results for the same inputs. LLM calls are not, so they have no path to enforcement decisions.
| Problem | Rules Engine | LLM |
|---|---|---|
| Hard thresholds and severity escalation | Yes, fast and auditable | Not used |
| Trend detection across runs | Yes, pure Python | Not used |
| Plain-English reason for each verdict | No, static templates only | Yes, 40-60 tokens per specialist and 60 tokens for the orchestrator |
| Cross-signal context at scale | Rule explosion risk as signals grow | Context injected into reason prompts only, no effect on severity |
| Deterministic output | Yes | No, non-deterministic across calls |
The LLM has no ability to:
- Override severity: the escalation and trend logic run first. The LLM reads the final verdict but has no way to change it.
- Trigger enforcement: the webhook receives the deterministic verdict. LLM output has no path to enforcement.
- Write free-form state: LLM output is parsed into structured fields. No unstructured text reaches the database.
- Query the source database: that database is read-only for this service. Agents read from a frozen snapshot fetched before any LLM call.
Scale and Resilience
Tradeoffs
Pipeline time scales with the number of apps and LLM latency. Apps run sequentially in the scheduler loop. At larger scale the natural next step is async fan-out across apps with a bounded concurrency limit, keeping the within-app wave order intact and staying within provider rate limits.
| Limitation | Mitigation |
|---|---|
| Apps run sequentially in the scheduler loop | Each LLM call has a 30s timeout. One slow app can delay others but cannot block the loop indefinitely |
| Three separate DB queries per run, one per agent | Simpler isolation and acceptable at current scale |
| Scheduler runs in the same process as the API | A scheduler hang affects the API and vice versa. A separate worker process would isolate these |
| LLM output is non-deterministic | Severity is always set by the rule engine first. LLM output only affects the reason string |
| Malformed LLM output | Missing fields default to none and monitor. A full agent failure falls back to a safe default result |
| Provider timeout or rate limit | Exponential backoff retries up to 3 times, then the circuit breaker trips if failures continue |
| Database unavailable | Connections are validated before each query. The health endpoint returns 503 if either database is unreachable |
Deployment
CI/CD and Docker
Every push goes through a full gate before anything reaches production. Images are tagged by commit SHA so any rollback targets an exact build, not a moving tag.
CI Gate
Every push runs the full suite before CD is allowed to start.
- Secrets: TruffleHog scans the commit history — no credentials reach the repo
- Code quality: mypy for type checking, ruff for linting and formatting
- Tests: pytest covers unit and integration tests
- Security: two Trivy scans — one on the filesystem, one on the final image
CD and Rollback
CD only starts after every CI check passes. The deploy is not considered done until the service is healthy.
- Health polling: /health is checked every 5 seconds for 90 seconds after deploy
- Automatic rollback: if health checks fail, the previous SHA-tagged image restarts automatically
- No drift: SHA tags mean the rollback image is identical to what ran before, not a rebuilt version
Image Build
The image is built in two stages so the final artifact is as small and clean as possible.
- Multi-stage build: compiler toolchain and build dependencies stay in the build stage only and never reach the final image
- PID 1: tini runs as PID 1 to handle signals correctly and reap zombie processes
- Non-root: the container runs as a non-root user so a breakout has no host privileges
Explore the Code
Source, eval framework, and CI/CD pipeline all on GitHub.