Skip to content

[Feat] Implement per-user LLM rate limiting and update documentation#5604

Open
Jayanaka-98 wants to merge 10 commits intojaseci-labs:mainfrom
Jayanaka-98:rate_limmiting
Open

[Feat] Implement per-user LLM rate limiting and update documentation#5604
Jayanaka-98 wants to merge 10 commits intojaseci-labs:mainfrom
Jayanaka-98:rate_limmiting

Conversation

@Jayanaka-98
Copy link
Copy Markdown
Collaborator

Per-User LLM Rate Limiting and Budget Enforcement

Adds configurable per-user rate limits and spend caps on all by llm() calls in jac-scale. When a user exceeds any configured limit, the call is blocked before it reaches the LLM provider and an exception is raised.


Motivation

Without per-user limits, a single user can exhaust the entire LLM budget of a deployment, either accidentally (runaway loops) or deliberately. This PR makes it possible to set hard guardrails per user at the platform level, without requiring any changes to application code.


What changed

context.jac: Added username field to JScaleExecutionContext. This is deliberately scoped to jac-scale (not the jaclang base ExecutionContext) since username is a deployment concern, not a language runtime concern.

jfast_api.impl.jac: In request_context_middleware, the authenticated username from request.state is written into the execution context after JWT validation. Uses hasattr duck-typing to avoid a cross-module import.

llm_telemetry.jac / llm_telemetry.impl.jac: New JacUserRateLimiter class (pure logic, no litellm dependency) with:

  • check_pre_call(username): checks all configured limits, raises on violation
  • record_success(username, tokens, cost_usd, model): increments Redis counters and writes a MongoDB usage record

A thin _LiteLLMRateLimitAdapter(CustomLogger) wires the limiter into LiteLLM's native callback system (log_pre_api_call for blocking, log_success_event for tracking). The adapter reads the username from the Jac execution context at call time.

Budget persistence: RPM/RPD/TPM/TPD use Redis counters with short TTLs (lost on restart is fine). Daily and monthly budgets use MongoDB as the source of truth with a 60-second Redis cache. On cache miss (e.g. after a Redis restart), the total is rebuilt from a MongoDB aggregation, so budget limits survive restarts.

config_loader.jac / config_loader.impl.jac: New [plugins.scale.llm_limits] section with enabled, rpm, rpd, tpm, tpd, daily_budget_usd, monthly_budget_usd. All fields default to null (disabled).

test_llm_rate_limiting.jac: 14 tests across three tiers:

  • Unit tests (mock Redis/MongoDB): each limit type blocks correctly, no-Redis is a no-op, budget cache hit/miss behavior, MongoDB write shape
  • Integration tests (real Redis via testcontainers): RPM enforced + per-user isolation, TPM accumulation, RPD day-rollover reset
  • Persistence tests (real Redis + MongoDB): daily budget rebuilds from MongoDB after Redis flush, monthly budget spans multiple days correctly

docs/reference/plugins/jac-scale.md: New "Per-User LLM Rate Limiting" section covering configuration, how it works, unauthenticated request behavior, MongoDB usage record schema, and a minimal daily-budget-only example.


Configuration

[plugins.scale.llm_limits]
enabled = true
rpm  = 60
rpd  = 1000
tpm  = 100000
tpd  = 500000
daily_budget_usd   = 5.00
monthly_budget_usd = 50.00

All fields are optional. Omit any to leave that dimension unlimited. If enabled = false or the section is absent, no limiting occurs at all.


Design notes

  • No application code changes required. Limits are enforced at the infrastructure layer via LiteLLM's CustomLogger callback, the same mechanism used by the existing JacLLMLogger telemetry.
  • Unauthenticated requests are not blocked. If no username is present in the execution context, the limiter is skipped. Gate with :priv walkers if you need enforcement on all traffic.
  • Token limits are best-effort pre-call. log_pre_api_call fires before the LLM call; token counts come from the post-call event. The check reads the accumulated counter and blocks if already at or above the limit. The first call that crosses the threshold goes through.
  • Streaming coverage. log_pre_api_call fires for both streaming and non-streaming calls routed through litellm.completion. Calls made via the OpenAI SDK directly (bypassing litellm) are not intercepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant