⏱️ Rate Limiter

A distributed rate limiting service built with Spring Boot and Redis. Instead of embedding rate limiting logic into every API or gateway, this runs as a centralized service that any upstream system can call. The response includes standard rate limit metadata that callers can forward directly to clients.

Most API gateways support rate limiting, but those limits are often tied to a single gateway instance or vendor platform. This service centralizes enforcement so multiple gateways and services can share the same policies.

Spring Boot Spring AI MCP Redis Token Bucket Fail-open / Fail-closed Bucket4j Lettuce Lua Script PostgreSQL SSE Docker Resilience4j CI/CD

Overview

What It Does

You register an app, assign it a rate limit plan, and this service does the rest. When a request comes in, the caller sends a service identifier, client IP, and request path; the service checks the token bucket in Redis and responds with allowed, remaining tokens, and reset-after. Callers can forward those values directly as RateLimit-* headers.

Every decision (allowed or blocked) gets written to PostgreSQL with the full request context: IP, path, HTTP method, device type, browser, OS, user agent classification, and basic bot detection signals. That audit log makes it straightforward to investigate abuse or tune limits after the fact.

Algorithm

Rate Limiting Strategy

Token Bucket via Bucket4j, with all state stored in Redis. Because the bucket lives in Redis rather than in-process, the limit holds the same whether you're running one instance or ten.

🪣

Token Bucket (Bucket4j + Redis)

Each app gets a bucket with a capacity ceiling and a steady refill rate. Incoming requests spend tokens; once the bucket is empty, requests are blocked until tokens refill.

Handles short bursts: callers can spend up to the full capacity before hitting the limit
Continuous refill: tokens are restored steadily over time instead of resetting in fixed intervals
Bucket state lives in Redis, consistent across every running instance
Read-modify-write is a single atomic Lua script, so no race conditions under concurrency

Architecture

How It's Wired Together

Kept the data model intentionally split: Redis holds live token counts with TTLs based on bucket refill configuration; inactive keys expire automatically so stale buckets never accumulate. PostgreSQL holds the durable stuff: app registrations, rate limit plans, and the full audit log. Spring Boot wires the two together and exposes the HTTP API.

Architecture diagram: Client Request → Spring Boot API → Redis (Bucket4j token bucket) and PostgreSQL (config + audit)

Implementation

Key Design Decisions

          Atomic Lua Scripts
          Problem: If two nodes both read "one token left" before either writes back, both requests slip through.
Solution: Bucket4j sends the entire check (read, refill, decrement, write) as a single Lua script. Redis executes Lua atomically, so no two requests can interleave on the same key.
Tradeoff: Operations against the same bucket key are effectively serialized to preserve correctness, while different bucket keys are processed independently. Worth knowing under very high contention on a single key.

        

Lettuce Async Client

Lettuce can multiplex many concurrent requests over a small number of shared connections, so Redis calls don't block threads while waiting on the network. Under load, that adds up: fewer connections handle everything without the pool pressure you'd see with a blocking client.

          Config-Driven Plans
          Each app's capacity, refill rate, and refill window are stored in PostgreSQL.
Loaded lazily on first use, then held in an in-process cache to avoid repeated DB lookups on hot paths.
Cache can be invalidated via an admin endpoint when configs change, no service restart required.

        

          Per-IP or Per-App Limiting
          Per-app: the whole service shares one bucket, good for capping total throughput. Key: rate_limit:{appId}:{matchedPattern}
Per-IP: each caller gets their own bucket, better for public APIs where one client shouldn't exhaust everyone else's quota. Key: rate_limit:{appId}:{matchedPattern}:{clientIp}
appId is the numeric database ID of the registered app, not the human-readable service name.

        

          Circuit Breaker & Retry
          Retry: two attempts, 50ms apart, only on Redis exceptions.
Circuit breaker: trips when 50% of calls fail over a 20-request window.
Once open, waits 10 seconds before testing again.
Goal: stop hammering a Redis instance that's down and recover automatically when it comes back.

        

          Fail-Open / Fail-Closed
          Fail-open: lets the request through. Prioritizes availability over enforcement.
Fail-closed: returns a 503. Prioritizes correctness over availability.
Configured per-app in the database rather than as a global setting, so different services can choose different failure behavior depending on how critical enforcement is for them.
Both outcomes are logged so you can see how often the fallback actually fires in production.

        

MCP Server

AI-Accessible Observability

The service exposes its internal data through a Model Context Protocol (MCP) server so any third-party application or agent system can query rate limiter state without holding direct database credentials or knowing the internal schema. Any MCP-compatible client (an AI assistant, an internal monitoring tool, or a custom agent pipeline) can connect and call these tools over the network.

Spring AI's MCP support handles all the transport wiring. Each tool is a plain Spring bean method annotated with @Tool; Spring AI registers them over HTTP/SSE transport automatically. There are 7 tools across 3 groups, all read-only.

Audit Log Tools

Exposes rate_limit_log table data with time-window filtering. This is the highest-priority group: agents need recent logs to compute their metrics and cannot do so efficiently through the existing paginated REST endpoint.

get_recent_logs raw log entries for an app within N minutes, up to 5 000 records; includes IP, path, HTTP method, remaining tokens, response code, bot signals, and Redis failure flag
get_error_summary block rate, response-code breakdown, top error paths, top block reasons, unique IPs, IP concentration percentage, and Redis failure count
get_token_health_summary average, min, and max remaining tokens; depletion and near-depletion (≤10% capacity) counts; per-path token consumption; IPs near or at depletion
get_top_paths_summary top 10 paths by traffic and top 5 by block rate over a 60-minute window, with per-path method and IP breakdown

App Registry Tools

Exposes app_info and rate_limit_plan records so agents know which apps exist, what their limits are, and whether per-IP mode is active.

list_apps all enabled apps; the agent scheduler calls this at the start of every run to decide what to analyze
get_app single app with all active plans; gives the orchestrator agent the capacity, refill rate, refill window, and per-IP flag it needs to frame the analysis

Service Health Tools

Exposes the operational state of the service itself. Agents check this before interpreting results: a low block rate during a Redis outage in fail-open mode should not be treated as healthy traffic.

get_service_health database and Redis connection status (ok / error: {message}), configured failure strategy, and timestamp

Transport & Auth

The MCP server runs over HTTP/SSE transport, the right choice for a standalone Spring Boot service that agents reach over the network. No subprocess or stdio needed; the agent project connects by URL.

All MCP routes are gated by a shared-secret filter. Requests to /mcp/** must include an X-MCP-Secret header matching the value in application.yml. Requests with a missing or invalid secret are rejected with 401 by a servlet filter before reaching the MCP handlers.

API

Request & Response

Three fields are required; everything else is optional context used for logging and future policy extensions. The response body maps directly to standard RateLimit-* headers.

▸Request

POST /api/v1/ratelimit/check

{
  "serviceIdentifier": "payments-api",
  "clientIp": "203.0.113.42",
  "requestPath": "/api/checkout",
  "httpMethod": "POST",
  "traceId": "abc-123-def-456",
  "deviceType": "desktop",
  "isBot": false,
  "botName": null,
  "browser": "Chrome",
  "os": "Windows",
  "requestSize": 512,
  "referer": "https://google.com"
}

Required serviceIdentifier, clientIp, requestPath
Optional httpMethod, traceId, deviceType, isBot, botName, browser, os, requestSize, referer

▸Response · 200 OK

HTTP/1.1 200 OK
RateLimit-Limit: 100
RateLimit-Remaining: 42
RateLimit-Reset: 12
Content-Type: application/json

{
  "serviceName": "payments-api",
  "allowed": true,
  "remainingTokens": 42,
  "limit": 100,
  "resetAfterSeconds": 12,
  "matchedPattern": "/api/**",
  "timestamp": "2025-05-10T14:23:01.456Z"
}

▸Response · 429 Too Many Requests

HTTP/1.1 429 Too Many Requests
RateLimit-Limit: 100
RateLimit-Remaining: 0
RateLimit-Reset: 12
Retry-After: 12
Content-Type: application/json

{
  "error": "rate_limit_exceeded",
  "code": 429,
  "message": "Too many requests",
  "retryAfterSeconds": 12,
  "timestamp": "2025-05-10T14:23:01.456Z"
}

Load Test

Performance Under Load

Six k6 scenarios were executed sequentially against a single server (3 vCPU, 3.7 GB RAM, Ubuntu 22.04), generating 635,870 total requests over ~585 seconds. k6 ran on the same host as the service, so results exclude network latency but include CPU contention between load generator and application.

        Throughput: ramp from 50 to 100 VUs to determine sustainable request rate under increasing load
Latency: 20 VUs steady-state test for p50/p95/p99 measurement under stable conditions
Cache effectiveness: 30 VUs comparing cold DB lookups vs warm cache hits
Concurrency correctness: 50 VUs issuing 100 concurrent requests against a 50-token bucket to validate atomic enforcement
Fail-open baseline: 10 VUs with Redis healthy (p50: 5.0 ms)
Fail-open degraded: Redis disabled, same load to validate fallback behavior (p50: 4.0 ms due to Redis bypass)
Audit logging: all requests persisted synchronously via @Transactional PostgreSQL writes (no async buffering)

      

Methodology

k6 executed on the same host as the service, excluding network latency but introducing CPU contention
All requests include synchronous PostgreSQL writes via @Transactional
Metrics collected after ~150,000 warm-up requests to eliminate JVM cold-start effects
Load generated with no pacing; throughput constrained only by system capacity
HTTP 429 responses are treated as valid rate-limiting outcomes, not system errors

1,086.9 avg req/s

12.3 ms p50 latency

97.4 ms p95 latency

162.6 ms p99 latency

1.5x cache speedup

0% application errors

100% availability (Redis outage)

Cache speedup compares cold database lookups (p50: 22.5 ms) against warm cache hits (p50: 15.0 ms). The fail-open degraded latency (4.0 ms) is lower than baseline (5.0 ms) because Redis calls are bypassed when the circuit breaker is open. This reflects reduced per-request work, not improved healthy-state performance. The concurrency test confirms correct enforcement under full contention for the tested 50-token bucket using atomic Lua scripting.

Live Demo

Try It on This Page

This page is rate limited. Refresh more than twice within 5 seconds and the gateway returns a 429 Too Many Requests instead of this page. Wait 5 seconds and it loads normally again.

Active Plan — `/projects/rate-limiter`

capacity	refill rate	refill period seconds
2 tokens	2 tokens	5 s

Each visit spends 1 token. After 2 requests the bucket is empty; the 3rd returns 429. Tokens refill continuously after 5 s the bucket is full again.

View the Source

Full implementation on GitHub.

GitHub