Rate Limiter Agents

GitHub Live Demo

A multi-agent AI system that monitors rate-limit telemetry for multiple apps on a recurring schedule. Three specialist agents analyze error patterns, token health, and endpoint traffic in a DAG pipeline. An orchestrator combines their outputs into a single verdict (monitor, alert, throttle, or block) with a plain-English reason for each decision.

Python FastAPI Multi-Agent PostgreSQL APScheduler Docker CI/CD

TL;DR

error_pattern always runs first. token_bucket_health and top_paths run in parallel after it. The orchestrator waits for all three before deciding.
Triage runs before any agent. If block rate is low and there are no 5xx errors, the deeper agents are skipped to avoid unnecessary LLM calls.
Severity is pure Python: an escalation function combines the three agent outputs, then a trend check looks at the last 3 runs. The LLM only writes the reason string, 60 tokens max.
Two memory layers per app: a global EWMA baseline (alpha 0.1) and 168 time-of-day/day-of-week buckets. Spike detection uses the bucket for the current hour and weekday.
8 eval scenarios run nightly in an isolated SQLite database. Severity and action accuracy are tracked over time so regressions surface before production.
Each agent has an in-memory circuit breaker. Three consecutive failures trip it for 5 minutes. Failed agents use a safe fallback so the pipeline always completes.

Motivation

Why I Built This

Rate-limit dashboards show you what happened. They don't tell you if it's normal for that app at that hour, or whether it's getting worse. A 40% block rate might be fine for an app with aggressive per-IP limits, or it might be the leading edge of an attack on a shared-bucket API. I built this because that gap is where most incidents start: you have the signal, but not the context to act on it. The core design question was where deterministic rules end and where LLM reasoning should begin. The answer here is that severity is always set by code and the LLM handles only the explanation. In 8 nightly eval scenarios spanning normal traffic through multi-vector attacks, severity accuracy and action accuracy both hold at 100%.

The stack choices follow the same principle. Simple where simple works, explicit where it matters. FastAPI because the dashboard endpoints benefit from async handlers and Pydantic validation catches malformed payloads at system boundaries. APScheduler rather than Celery because this scale doesn't require a broker. An in-process scheduler simplifies deployment and removes a runtime dependency. PostgreSQL for the agent database because EWMA baseline state must survive restarts and support concurrent reads from the dashboard. SQLite is used for evals in isolation, but production baselines need ACID guarantees.

Architecture

How It's Wired Together

The data source queries the rate-limiter database once at the start of each run. All summaries go into a PipelineState object. Agents read from that frozen snapshot and never query the database themselves, so there are no consistency issues between reads within the same run.

The sequence diagram below shows one complete pipeline run for a single app, covering the order of database reads, LLM calls, and writes from start to finish.

Sequence diagram: Scheduler triggers Agent Pipeline on a recurring schedule. Data source fetches error summary, token summary, and paths summary from Rate Limiter DB and builds PipelineState v0. Triage selects agents. Wave 1: error_pattern runs inline. Wave 2: token_bucket_health and top_paths run in parallel threads. Each agent calls LLM for reason if anomaly detected, writes AgentResult. Orchestrator reads BaselineMemory, TimeBaseline, and recent OrchestratorResults from Agent DB, calls LLM for reason string, writes OrchestratorResult, updates baselines. Enforcement webhook receives the verdict.

Startup and data layer details

Two databases with different roles. The rate-limiter database is read-only from this service. No writes, no migrations against it. The agent database stores results, baselines, and eval outcomes. Alembic migrations run against the agent database on startup before the scheduler starts. If migrations fail, startup aborts.

Both database engines validate connections before each query and reconnect transparently if a database restarted. The /health/ready endpoint runs a live check against both databases and returns 503 if either is unreachable.

Pipeline

How a Run Works

One complete pipeline run goes through four stages. Telemetry is fetched once into a frozen snapshot, agents analyze that snapshot independently, the orchestrator combines their results into a single severity verdict, and the verdict is sent to the enforcement webhook.

1. Triage

Before any agent runs, the orchestrator checks the raw telemetry to decide which agents are needed for this run.

Always runs: error_pattern runs on every pipeline execution regardless of traffic state
Conditional: token_bucket_health and top_paths activate only if block rate is above 10% or any 5xx errors are present
Clean traffic: one agent runs, one LLM call is made

2. DAG Execution

The agent dependency graph resolves into two sequential waves based on which agents triage selected.

Wave 1: error_pattern runs on the main thread with no thread overhead
Wave 2: token_bucket_health and top_paths run simultaneously in separate threads, each with its own database session
Log correlation: all log lines for a given run share the same pipeline run ID, including those from worker threads
State update: the main thread collects all results after both waves settle and updates state in a fixed order to keep versioning deterministic

3. Decision

The orchestrator combines the three agent results and determines a final severity verdict using pure Python.

Escalation: a fixed rule set combines the three severities — any critical input leads to a critical verdict, two high inputs lead to critical, two medium inputs lead to high
Trend check: the last 3 runs are checked — a consistently escalating pattern bumps severity up one level, a recovering pattern bumps it down
Reason: the LLM is called once to write a 60-token plain-English reason. If that call fails, a structured fallback is used instead

4. Enforcement

The verdict is posted to a configurable webhook. The call is best-effort — failures are logged but do not affect the stored verdict.

block: example integrations include WAF deny list or Redis block set
throttle: rate-limit policy update
alert: PagerDuty or Slack notification
monitor: logged only, no action taken

Agents

Specialist Agents

Three specialist agents each watch a different signal: error patterns, token health, and endpoint traffic. Each one classifies severity using deterministic rules first, then calls the LLM only when an anomaly is detected — to write a short plain-English reason, nothing more.

① Error Pattern Agent

Classifies severity based on block rate, error rate, and 5xx presence over a configurable time window. LLM is called only when an anomaly is detected, to write a 40-token reason.

Always runs — the only agent that bypasses triage, even on clean traffic.

Per-IP mode: if one IP accounts for most blocks, severity is eased down — the limiter is working as intended

Severity thresholds

Severity	Threshold
Critical	error_rate > 50% or any 5xx
High	error_rate > 30%
Medium	error_rate > 20%
Low	error_rate > 5%

② Token Bucket Health Agent

Checks what percentage of requests have near-zero remaining tokens (10% or less of observed capacity) over a configurable time window.

Per-IP mode: many unique IPs depleting at the same time suggests a coordinated surge rather than a single bad actor

Severity thresholds

Severity	Threshold
Critical	> 70% near-depleted, or 5+ at zero
High	> 50% near-depleted
Medium	> 30% near-depleted
Low	> 10% near-depleted

③ Top Paths Agent

Identifies endpoints with disproportionate block rates or traffic concentration over a configurable time window. Flags an anomaly when any path block_rate exceeds 50%, or a single path accounts for more than 80% of total traffic.

Per-IP mode: blocks concentrated on 1-2 IPs at a high-block path is expected behavior, not a signal

Severity thresholds

Severity	Threshold
Critical	any path block_rate > 80%
High	block_rate > 60%
Medium	block_rate > 40%
Low	block_rate > 20%

Orchestrator

Decision Engine

Each agent works on its own signal and produces its own verdict. The orchestrator is what connects them. It reads all three results, applies a fixed rule set to determine severity, and sends a single enforcement action downstream.

★ Core

Severity is set by code, not the LLM

Block and throttle decisions need to be consistent and auditable. LLM outputs are not. The same inputs can produce different results across calls, so severity is always determined by code.

Escalation rule: any critical agent leads to a critical verdict; two high agents lead to critical; two medium agents lead to high. No model is involved.
Trend check: if the last 3 runs show severity strictly increasing, it bumps up one level. Strictly decreasing, it bumps down.
LLM role: called once after severity is already set, to write a short plain-English reason. It reads the verdict but has no ability to change it.

Most runs never activate all three agents

Clean traffic costs one LLM call, not three. Before any agent runs, the orchestrator reads the raw telemetry. If block rate is below 10% and no 5xx errors are present, only error_pattern runs and the other two are skipped. On elevated or degraded traffic, all three activate.

Execution order is a DAG, not a hardcoded sequence

error_pattern always runs first because token_bucket_health and top_paths both depend on its signal. That dependency structure is a DAG. Execution order comes from the graph, not a fixed sequence. Once error_pattern finishes, the other two run in parallel threads. Adding a new agent means declaring its dependencies once, not touching the pipeline itself.

Circuit Breaker

One failing agent never stalls the pipeline

LLM providers fail. Retries with exponential backoff can take several seconds per attempt. Without isolation, one bad agent slows down every pipeline run. Each agent has its own circuit breaker for this reason.

Trips open: 3 consecutive failures and the circuit opens. That agent is skipped for the next 5 minutes.
While open: a safe fallback fills the slot with severity none and action monitor. The pipeline still runs and produces a verdict.
Recovery: after 5 minutes, one probe call goes through. If it succeeds, the circuit resets. If it fails, it stays open.

Every run produces one of four verdicts. The verdict is sent to a configurable webhook and the receiver decides what to do with it.

Final verdict mapping
Severity	Action	Example integrations
Critical	block	WAF deny list, Redis block set
High	throttle	Rate-limit policy update
Medium	alert	PagerDuty, Slack notification
Low / none	monitor	Logged only, no enforcement

Resilience

How Failures Are Handled

LLM providers can fail, time out, or return errors at any point in a pipeline run. The system handles this in three layers. Retries catch transient errors before they count as failures. Timeouts stop the pipeline from waiting on a hung provider. And if an agent fails past all of that, saga compensation steps in so the pipeline still finishes and produces a verdict.

Retry with Backoff

Every LLM call goes through the same retry logic. If a call fails with a transient error, it retries up to 3 times before that failure counts toward the circuit breaker.

Retryable: RateLimitError, APITimeoutError, InternalServerError, ServiceUnavailableError, APIConnectionError
Delay: 2s × 2^attempt plus random jitter, so retries space out rather than hammering the provider
Non-retryable: errors like bad requests fail immediately and don't use the retry budget

Per-Call Timeout

Every LLM call runs in a daemon thread with a 30-second hard limit. If the provider is hanging, the call doesn't wait forever.

At 30s: the main thread stops waiting and raises a TimeoutError
Impact: counts as one failure toward the circuit breaker threshold
Result: three consecutive timeouts trip the circuit. The agent is skipped on the next run until the breaker resets.

★ Core

Saga Compensation

The pipeline always produces a verdict. If an agent fails past all retries, a fallback result steps in so the orchestrator has something to work with.

Per agent: the fallback carries severity none and action monitor, contributing the least-alarming signal to the final decision
Pipeline guarantee: the orchestrator always runs and writes a verdict to the database, even if all three agents fail
Outcome tracking: after each run, the previous verdict is checked to see if the anomaly resolved in the next cycle

Provider

LLM Provider Abstraction

Agents never reference a specific LLM provider directly. All LLM calls go through a shared interface that handles retries, timeouts, and cost tracking in one place. Swapping from Anthropic to OpenAI is a config change, not a code change.

Two Call Modes

The interface defines two ways to call a model, depending on what the agent needs back.

Plain text: used for reason strings — agents pass a prompt and get a short plain-English response
Structured tool call: used for agent analysis — the model returns validated JSON via function calling, no text parsing needed
Both modes: go through the same retry logic and 30-second timeout before the circuit breaker sees any failure

Cost Tracked at the Provider Level

Every response carries input tokens, output tokens, and cost in USD. Cost is calculated inside the provider so all agent code above it stays provider-agnostic.

Stored per result: each agent result and orchestrator result records its own token usage and cost
Queryable by day: the dashboard breaks cost down by agent name and day so you can see which agent is responsible for what spend

Memory

Baselines and Context

The orchestrator reads three layers of historical context before making any decision. All three are loaded before any LLM call so the reason string reflects the full picture.

★ Core

Global Baseline

One record per app that tracks requests per second, block rate, and bot ratio as EWMA values. Each new run contributes 10% weight and older values decay gradually without dropping to zero.

Adaptive threshold: apps with a low normal block rate use a tighter spike threshold; apps that regularly run high use a looser one to cut false alarms
Spike multiplier: each run calculates how far the current block rate sits above the historical average, giving the orchestrator a sense of scale

★ Core

Time-of-Day Baseline

168 separate buckets per app, one for each hour of the week. Each bucket tracks block rate with its own EWMA so Monday 9am and Sunday 2am are compared against their own historical norms, not a shared weekly average.

Current bucket: the orchestrator picks the bucket matching the current UTC hour and weekday as the spike reference
Fallback: if no data exists for a given time slot yet, the global baseline is used instead

Recent Run History

The last 3 pipeline results are checked for a directional pattern before the final severity is set.

Escalating: severity strictly increasing across all 3 runs bumps the current verdict up one level
Recovering: severity strictly decreasing bumps it down one level
Stable: mixed results or fewer than 2 prior runs leave the verdict unchanged

Evals

How It's Tested

8 scenarios run nightly in isolated SQLite databases. Each one generates synthetic traffic, runs the full agent pipeline, and checks the actual severity and action against the expected values. Accuracy is tracked over time so any regression from a prompt or threshold change surfaces before it reaches production. Evals can also be triggered on demand via /evals.

          The 8 Scenarios
          normal_traffic: low block rate, healthy tokens, even path spread. All agents should stay none.
high_error_rate: 10% 5xx plus 50% 4xx. error_pattern should fire critical.
high_block_rate: 35% block rate, no 5xx. error_pattern high.
token_depletion: 80% of requests at 0-2 remaining tokens. token_bucket_health critical.
path_attack: single path at 83% block rate. top_paths critical.
flash_crowd: 500 requests, 1.6% block rate, healthy tokens. Should stay none — a legitimate spike, not an attack.
gradual_ramp: 25% block rate plus 40% near-depletion. Two medium agents; orchestrator escalates to high.
multi_vector: path attack combined with token depletion. Orchestrator critical.

        

How Isolation Works

Each scenario runs in its own fresh in-memory database. Eval writes never touch real tables or production data.

Same logic: the same aggregation code used in production converts synthetic logs to summaries, so evals exercise the real pipeline end to end
Accuracy tracking: severity accuracy and action accuracy are stored per eval run as percentages
Visible trend: the accuracy history is queryable at /evals and shown in the dashboard, so any drop from a prompt change is caught before it affects live verdicts

Observability

What's Visible at Runtime

Every pipeline run generates structured logs and writes its results to the database. Three dashboard endpoints surface what's happening across runs, agents, and apps.

Structured Logs

Every log entry includes a pipeline_run_id set at the start of each run and propagated into worker threads via ContextVar.

Per entry: timestamp, level, logger name, message, and pipeline_run_id
Correlation: all log lines from a given run share the same ID, so they group naturally in any log aggregator
Cross-thread: worker threads inherit the run ID automatically, no manual passing needed

Dashboard Endpoints

Three read-only endpoints expose runtime state for monitoring and debugging.

/dashboard: 7-day LLM cost series broken down by agent name and day
/agents: recent results per app including severity, action, reason, and token usage per agent
/evals: daily accuracy trend showing severity accuracy and action accuracy per nightly run

Design Rationale

Where LLMs Fit (and Don't)

There is a deliberate boundary between what the rule engine handles and what the LLM handles. The rule engine sets severity. The LLM writes a reason string. Code is auditable and produces consistent results for the same inputs. LLM calls are not, so they have no path to enforcement decisions.

What each layer handles
Problem	Rules Engine	LLM
Hard thresholds and severity escalation	Yes, fast and auditable	Not used
Trend detection across runs	Yes, pure Python	Not used
Plain-English reason for each verdict	No, static templates only	Yes, 40-60 tokens per specialist and 60 tokens for the orchestrator
Cross-signal context at scale	Rule explosion risk as signals grow	Context injected into reason prompts only, no effect on severity
Deterministic output	Yes	No, non-deterministic across calls

The LLM has no ability to:

Override severity: the escalation and trend logic run first. The LLM reads the final verdict but has no way to change it.
Trigger enforcement: the webhook receives the deterministic verdict. LLM output has no path to enforcement.
Write free-form state: LLM output is parsed into structured fields. No unstructured text reaches the database.
Query the source database: that database is read-only for this service. Agents read from a frozen snapshot fetched before any LLM call.

Scale and Resilience

Tradeoffs

Pipeline time scales with the number of apps and LLM latency. Apps run sequentially in the scheduler loop. At larger scale the natural next step is async fan-out across apps with a bounded concurrency limit, keeping the within-app wave order intact and staying within provider rate limits.

2-5 s per pipeline run (single app)

300-800 ms avg LLM call latency

~$0.004-0.012 per full pipeline run

O(1) baseline memory per app

~$8-15/day 15-min interval across 10-20 apps

Limitation	Mitigation
Apps run sequentially in the scheduler loop	Each LLM call has a 30s timeout. One slow app can delay others but cannot block the loop indefinitely
Three separate DB queries per run, one per agent	Simpler isolation and acceptable at current scale
Scheduler runs in the same process as the API	A scheduler hang affects the API and vice versa. A separate worker process would isolate these
LLM output is non-deterministic	Severity is always set by the rule engine first. LLM output only affects the reason string
Malformed LLM output	Missing fields default to none and monitor. A full agent failure falls back to a safe default result
Provider timeout or rate limit	Exponential backoff retries up to 3 times, then the circuit breaker trips if failures continue
Database unavailable	Connections are validated before each query. The health endpoint returns 503 if either database is unreachable

Deployment

CI/CD and Docker

Every push goes through a full gate before anything reaches production. Images are tagged by commit SHA so any rollback targets an exact build, not a moving tag.

CI Gate

Every push runs the full suite before CD is allowed to start.

Secrets: TruffleHog scans the commit history — no credentials reach the repo
Code quality: mypy for type checking, ruff for linting and formatting
Tests: pytest covers unit and integration tests
Security: two Trivy scans — one on the filesystem, one on the final image

CD and Rollback

CD only starts after every CI check passes. The deploy is not considered done until the service is healthy.

Health polling: /health is checked every 5 seconds for 90 seconds after deploy
Automatic rollback: if health checks fail, the previous SHA-tagged image restarts automatically
No drift: SHA tags mean the rollback image is identical to what ran before, not a rebuilt version

Docker

Image Build

The image is built in two stages so the final artifact is as small and clean as possible.

Multi-stage build: compiler toolchain and build dependencies stay in the build stage only and never reach the final image
PID 1: tini runs as PID 1 to handle signals correctly and reap zombie processes
Non-root: the container runs as a non-root user so a breakout has no host privileges

Explore the Code

Source, eval framework, and CI/CD pipeline all on GitHub.

GitHub Eval Framework CI/CD Pipeline