[Feat] Implement per-user LLM rate limiting and update documentation#5604
Open
Jayanaka-98 wants to merge 10 commits intojaseci-labs:mainfrom
Open
[Feat] Implement per-user LLM rate limiting and update documentation#5604Jayanaka-98 wants to merge 10 commits intojaseci-labs:mainfrom
Jayanaka-98 wants to merge 10 commits intojaseci-labs:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Per-User LLM Rate Limiting and Budget Enforcement
Adds configurable per-user rate limits and spend caps on all
by llm()calls in jac-scale. When a user exceeds any configured limit, the call is blocked before it reaches the LLM provider and an exception is raised.Motivation
Without per-user limits, a single user can exhaust the entire LLM budget of a deployment, either accidentally (runaway loops) or deliberately. This PR makes it possible to set hard guardrails per user at the platform level, without requiring any changes to application code.
What changed
context.jac: Addedusernamefield toJScaleExecutionContext. This is deliberately scoped to jac-scale (not the jaclang baseExecutionContext) since username is a deployment concern, not a language runtime concern.jfast_api.impl.jac: Inrequest_context_middleware, the authenticated username fromrequest.stateis written into the execution context after JWT validation. Useshasattrduck-typing to avoid a cross-module import.llm_telemetry.jac/llm_telemetry.impl.jac: NewJacUserRateLimiterclass (pure logic, no litellm dependency) with:check_pre_call(username): checks all configured limits, raises on violationrecord_success(username, tokens, cost_usd, model): increments Redis counters and writes a MongoDB usage recordA thin
_LiteLLMRateLimitAdapter(CustomLogger)wires the limiter into LiteLLM's native callback system (log_pre_api_callfor blocking,log_success_eventfor tracking). The adapter reads the username from the Jac execution context at call time.Budget persistence: RPM/RPD/TPM/TPD use Redis counters with short TTLs (lost on restart is fine). Daily and monthly budgets use MongoDB as the source of truth with a 60-second Redis cache. On cache miss (e.g. after a Redis restart), the total is rebuilt from a MongoDB aggregation, so budget limits survive restarts.
config_loader.jac/config_loader.impl.jac: New[plugins.scale.llm_limits]section withenabled,rpm,rpd,tpm,tpd,daily_budget_usd,monthly_budget_usd. All fields default tonull(disabled).test_llm_rate_limiting.jac: 14 tests across three tiers:docs/reference/plugins/jac-scale.md: New "Per-User LLM Rate Limiting" section covering configuration, how it works, unauthenticated request behavior, MongoDB usage record schema, and a minimal daily-budget-only example.Configuration
All fields are optional. Omit any to leave that dimension unlimited. If
enabled = falseor the section is absent, no limiting occurs at all.Design notes
CustomLoggercallback, the same mechanism used by the existingJacLLMLoggertelemetry.:privwalkers if you need enforcement on all traffic.log_pre_api_callfires before the LLM call; token counts come from the post-call event. The check reads the accumulated counter and blocks if already at or above the limit. The first call that crosses the threshold goes through.log_pre_api_callfires for both streaming and non-streaming calls routed throughlitellm.completion. Calls made via the OpenAI SDK directly (bypassing litellm) are not intercepted.