Building LiteLLM Plugins with `async_pre_call_hook` for Proxy Mode

You're using the LiteLLM Proxy, appreciating its ability to unify access to various Large Language Models (LLMs). But what if you need more? What if you need to dynamically route requests based on user budgets, enforce custom security policies before hitting an expensive model, modify prompts on the fly, or reject requests based on complex business logic? Standard callbacks log events after they happen, but true intervention requires getting involved before the core action. This is where LiteLLM plugins, powered by the async_pre_call_hook, come into play. This hook is arguably the most powerful intervention point within the proxy's request lifecycle, designed specifically for developers who need to intercept, analyze, modify, or even block requests just moments before they are sent to the target LLM provider. This comprehensive guide will walk you through everything you need to know to master the async_pre_call_hook and build sophisticated, custom plugins for your LiteLLM Proxy deployment. Who is this guide for? Python developers building custom solutions on top of the LiteLLM Proxy. Platform engineers needing to enforce specific policies, security measures, or cost controls. Anyone wanting to add custom routing, validation, or request enrichment logic directly within the proxy layer. Developers aiming to create shareable LiteLLM plugin modules. What you will learn: The precise role and timing of async_pre_call_hook within the proxy request flow. A deep dive into the hook's signature: its input parameters (user_api_key_dict, cache, data, call_type) and their significance. How to leverage the rich context provided by UserAPIKeyAuth for user-aware plugin logic. Effective techniques for inspecting and safely modifying the crucial data dictionary. Utilizing the DualCache for building stateful plugins (like rate limiters or usage trackers). Mastering the hook's return values (None, dict, str) and exception handling (HTTPException) to control the request's fate. Step-by-step instructions for creating, configuring, and testing your first plugin. Common and advanced plugin patterns: dynamic routing, input validation, PII masking, policy enforcement, custom rejection, A/B testing setup, and more. Best practices for writing performant, secure, and maintainable plugins. Debugging strategies for troubleshooting your hook implementations. Part 1: Why async_pre_call_hook is Your Go-To for Plugins LiteLLM offers various callback points (log_success_event, log_failure_event, etc.), primarily designed for observability – logging what happened after the fact. While essential, they don't allow you to change the course of action. The async_pre_call_hook stands apart due to its unique characteristics tailored for plugin development: Timing is Everything: It executes at the most critical juncture: after initial authentication and request preparation, but before the potentially costly and time-consuming call to the actual LLM API (like OpenAI, Anthropic, Cohere, etc.). This is your golden window for intervention. Intervention, Not Just Observation: Unlike logging callbacks, the async_pre_call_hook is designed for action. Its return value or the exceptions it raises directly dictate whether the request proceeds, gets modified, or is rejected outright. Rich Context: It receives not just the request payload but also vital contextual information: Detailed user/key authentication data (UserAPIKeyAuth). Access to the proxy's shared cache (DualCache). The specific type of call being made (call_type). Mutability: It receives the data dictionary (the prepared request payload for the LLM) and can directly modify it before allowing the request to proceed. Proxy-Specific Power: This hook is a feature of the LiteLLM Proxy environment, leveraging the proxy's infrastructure (authentication, caching, configuration) to enable complex logic that wouldn't be feasible in a simple library integration. In short, if your plugin needs to: Change the target model dynamically. Add, remove, or modify request parameters (temperature, max_tokens, messages, etc.). Validate input against custom rules before spending money on an LLM call. Enforce fine-grained access control based on user roles, budgets, or permissions stored in key metadata. Reject requests based on content analysis or external policy checks. Implement sophisticated rate limiting or concurrency controls. Mask sensitive data before it leaves your infrastructure. ...then async_pre_call_hook is the hook you need. Part 2: Anatomy of the async_pre_call_hook Signature Understanding the inputs and outputs of the hook is fundamental to building plugins. Let's dissect the signature provided by the CustomLogger base class: # Defined in litellm.integrations.custom_logger.CustomLogger async def async_pre_call_hook( self, user_api_key_dict: UserAPIKeyAuth, # Optional[Union[Exception, str, d

Mar 29, 2025 - 22:43
 0
Building LiteLLM Plugins with `async_pre_call_hook` for Proxy Mode

You're using the LiteLLM Proxy, appreciating its ability to unify access to various Large Language Models (LLMs). But what if you need more? What if you need to dynamically route requests based on user budgets, enforce custom security policies before hitting an expensive model, modify prompts on the fly, or reject requests based on complex business logic? Standard callbacks log events after they happen, but true intervention requires getting involved before the core action.

This is where LiteLLM plugins, powered by the async_pre_call_hook, come into play. This hook is arguably the most powerful intervention point within the proxy's request lifecycle, designed specifically for developers who need to intercept, analyze, modify, or even block requests just moments before they are sent to the target LLM provider.

This comprehensive guide will walk you through everything you need to know to master the async_pre_call_hook and build sophisticated, custom plugins for your LiteLLM Proxy deployment.

Who is this guide for?

  • Python developers building custom solutions on top of the LiteLLM Proxy.
  • Platform engineers needing to enforce specific policies, security measures, or cost controls.
  • Anyone wanting to add custom routing, validation, or request enrichment logic directly within the proxy layer.
  • Developers aiming to create shareable LiteLLM plugin modules.

What you will learn:

  • The precise role and timing of async_pre_call_hook within the proxy request flow.
  • A deep dive into the hook's signature: its input parameters (user_api_key_dict, cache, data, call_type) and their significance.
  • How to leverage the rich context provided by UserAPIKeyAuth for user-aware plugin logic.
  • Effective techniques for inspecting and safely modifying the crucial data dictionary.
  • Utilizing the DualCache for building stateful plugins (like rate limiters or usage trackers).
  • Mastering the hook's return values (None, dict, str) and exception handling (HTTPException) to control the request's fate.
  • Step-by-step instructions for creating, configuring, and testing your first plugin.
  • Common and advanced plugin patterns: dynamic routing, input validation, PII masking, policy enforcement, custom rejection, A/B testing setup, and more.
  • Best practices for writing performant, secure, and maintainable plugins.
  • Debugging strategies for troubleshooting your hook implementations.

Part 1: Why async_pre_call_hook is Your Go-To for Plugins

LiteLLM offers various callback points (log_success_event, log_failure_event, etc.), primarily designed for observability – logging what happened after the fact. While essential, they don't allow you to change the course of action.

The async_pre_call_hook stands apart due to its unique characteristics tailored for plugin development:

  1. Timing is Everything: It executes at the most critical juncture: after initial authentication and request preparation, but before the potentially costly and time-consuming call to the actual LLM API (like OpenAI, Anthropic, Cohere, etc.). This is your golden window for intervention.
  2. Intervention, Not Just Observation: Unlike logging callbacks, the async_pre_call_hook is designed for action. Its return value or the exceptions it raises directly dictate whether the request proceeds, gets modified, or is rejected outright.
  3. Rich Context: It receives not just the request payload but also vital contextual information:
    • Detailed user/key authentication data (UserAPIKeyAuth).
    • Access to the proxy's shared cache (DualCache).
    • The specific type of call being made (call_type).
  4. Mutability: It receives the data dictionary (the prepared request payload for the LLM) and can directly modify it before allowing the request to proceed.
  5. Proxy-Specific Power: This hook is a feature of the LiteLLM Proxy environment, leveraging the proxy's infrastructure (authentication, caching, configuration) to enable complex logic that wouldn't be feasible in a simple library integration.

In short, if your plugin needs to:

  • Change the target model dynamically.
  • Add, remove, or modify request parameters (temperature, max_tokens, messages, etc.).
  • Validate input against custom rules before spending money on an LLM call.
  • Enforce fine-grained access control based on user roles, budgets, or permissions stored in key metadata.
  • Reject requests based on content analysis or external policy checks.
  • Implement sophisticated rate limiting or concurrency controls.
  • Mask sensitive data before it leaves your infrastructure.

...then async_pre_call_hook is the hook you need.

Part 2: Anatomy of the async_pre_call_hook Signature

Understanding the inputs and outputs of the hook is fundamental to building plugins. Let's dissect the signature provided by the CustomLogger base class:

# Defined in litellm.integrations.custom_logger.CustomLogger
async def async_pre_call_hook(
    self,
    user_api_key_dict: UserAPIKeyAuth,  # <<< Your window into user/key context
    cache: DualCache,                   # <<< Your tool for managing state
    data: dict,                         # <<< The request payload (mutable!)
    call_type: Literal[                 # <<< The type of LLM operation
        "completion", "text_completion", "embeddings",
        "image_generation", "moderation", "audio_transcription",
        "pass_through_endpoint", "rerank", ... # (Potentially more types)
    ]
) -> Optional[Union[Exception, str, dict]]: # <<< Determines request outcome
    # Your plugin logic goes here
    pass

Let's break down each component:

  • self: The instance of your custom handler class (the one inheriting from CustomLogger). Allows you to access other methods or attributes you might define in your class (e.g., configuration loaded during __init__).

  • user_api_key_dict: UserAPIKeyAuth: This is arguably the most valuable parameter for contextual plugins. It's not just a simple dictionary, but a TypedDict (defined in proxy/_types.py) populated during the authentication phase (user_api_key_auth.py). It provides a wealth of information about the validated API key and its associated entities. You can reliably expect fields like:

    • token: The hashed version of the API key used.
    • key_name, key_alias: Human-readable identifiers for the key.
    • user_id: The identifier for the internal user associated with the key (if any).
    • team_id: The identifier for the team associated with the key (if any).
    • org_id: The identifier for the organization (if using org features).
    • spend, max_budget, soft_budget, budget_duration, expires: Budgeting and lifecycle info for the key itself.
    • tpm_limit, rpm_limit, max_parallel_requests: Rate limits configured for this specific key.
    • models: A list of model names/patterns this key is allowed to access.
    • metadata: A critical dictionary where you can store custom key-value pairs when creating/updating the key (via API or UI). Use this extensively for plugin configuration, like storing user roles, permissions flags, custom routing rules, specific budget overrides, etc.
    • team_spend, team_max_budget, team_tpm_limit, team_rpm_limit, team_models, team_metadata: Similar information, but inherited from the key's associated team. Your plugin logic might prioritize key-level settings over team-level ones or combine them.
    • user_role: The determined role of the user (e.g., LitellmUserRoles.PROXY_ADMIN, LitellmUserRoles.INTERNAL_USER).
    • end_user_id, end_user_tpm_limit, end_user_rpm_limit, end_user_max_budget: Information related to the end-user ID passed in the request (data['user']), if available and tracked.
    • parent_otel_span: For OpenTelemetry integration.
    • ...and potentially more fields depending on LiteLLM version and configuration. Plugin Usage: Essential for implementing any logic that depends on the caller's identity, permissions, budget, or pre-configured settings.
  • cache: DualCache: An instance of the DualCache class (defined in caching/dual_cache.py). This object provides an interface to LiteLLM's caching system, which typically combines an in-memory cache (like InMemoryCache) for speed and potentially a distributed cache (like RedisCache) for persistence and sharing state across multiple proxy instances.

    • Key Methods: async_get_cache(key), async_set_cache(key, value, ttl=...), async_batch_get_cache(keys), async_batch_set_cache(cache_list, ttl=...), async_increment_cache(key, value, ttl=...), async_delete_cache(key).
    • Plugin Usage: Absolutely essential for stateful plugins. The parallel_request_limiter.py example relies heavily on the cache to store and update request counts per key/user/team per minute. Use it for:
      • Rate limiting / Throttling counters.
      • Tracking recent user activity.
      • Storing temporary flags or states related to a user or request flow.
      • Caching results from external systems queried by the plugin (use with appropriate TTLs).
  • data: dict: This is the heart of the request payload that LiteLLM is preparing to send to the underlying LLM provider. Crucially, it is mutable. Modifications you make to this dictionary within the hook will be reflected in the actual call made, if your hook allows the request to proceed.

    • Content: Contains standard LLM call parameters like model, messages, input (for embeddings), temperature, max_tokens, stream, tools, function_call, user, etc.
    • Enrichment: As confirmed by litellm_pre_call_utils.py, this dictionary arrives at the hook already enriched with a metadata (or litellm_metadata) sub-dictionary containing contextual info derived from user_api_key_dict and the request headers/query params. This means you often don't need to re-extract basic info from user_api_key_dict if it's already conveniently placed in data['metadata'].
    • Plugin Usage:
      • Inspection: Read values like data['model'], data['messages'], data['user'] to make decisions.
      • Modification: Directly change values: data['model'] = 'new-model', data['max_tokens'] = 500, data['messages'].append(...).
      • Addition: Add new parameters if supported by the target model/LiteLLM: data['custom_param'] = 'value'.
      • Deletion: Remove parameters: del data['frequency_penalty'].
  • call_type: Literal[...]: A string indicating the type of LLM operation being attempted. This allows your plugin to apply logic selectively.

    • Examples: "completion", "embeddings", "image_generation", "moderation", "audio_transcription", "rerank", "pass_through_endpoint".
    • Plugin Usage: Use conditional logic (if call_type == "completion": ...) to ensure your plugin only runs or behaves correctly for the intended operations. For example, a prompt modification plugin should likely only run for "completion". An embedding cost check plugin should only run for "embeddings".
  • Return Value (-> Optional[Union[Exception, str, dict]]): This determines the fate of the request after your hook finishes.

    • Return None (Implicitly or Explicitly): Proceed. Signals that the hook's checks passed and the request should continue to the LLM provider. The data dictionary (potentially modified within the hook) will be used for the call. This is the standard way to allow a request after inspection/modification.
    • Return data (dict): Proceed with Modifications. Explicitly returns the (potentially modified) data dictionary. Functionally similar to returning None after modifying data in place, but can be clearer. The returned dictionary replaces the original data for the LLM call.
    • Return str: Reject with Custom Response (Chat/Completion Only). For call_type "completion" or "text_completion", returning a string signals rejection. LiteLLM intercepts this, does not call the LLM, and formats the string as a standard successful response from the assistant (as seen in the "Hello world" rejection example). For other call_types, this might result in a 400 error or unexpected behavior – primarily use this for chat endpoints.
    • Raise Exception: Reject with Error. This is the standard way to forcefully reject a request due to policy violations, invalid input, budget limits, or failed external checks.
      • Raising fastapi.HTTPException(status_code=..., detail=...) is highly recommended. It allows you to control the exact HTTP status code (e.g., 400 Bad Request, 401 Unauthorized, 403 Forbidden, 429 Too Many Requests, 500 Internal Server Error) and the error message returned to the original client.
      • The parallel_request_limiter.py uses HTTPException(status_code=429, ...) for rate limit exceeded errors.

Part 3: Implementing Your Plugin - Step-by-Step

Let's walk through the process of creating a simple plugin. We'll create a plugin that:

  1. Checks if the user has a specific role defined in their key metadata (plugin_access: true).
  2. If they have access and are making a "completion" call, it enforces a maximum max_tokens limit.
  3. If they don't have access, it rejects the request with a 403 Forbidden error.

Step 1: Create the Custom Handler File

Create a Python file (e.g., my_plugins.py) where your LiteLLM proxy can import it.

# my_plugins.py
import sys
from litellm.integrations.custom_logger import CustomLogger
from litellm.proxy._types import UserAPIKeyAuth # Import the type hint
from litellm.caching.dual_cache import DualCache # Import the type hint
from typing import Optional, Literal, Union
from fastapi import HTTPException # Import for raising specific errors
import litellm # Optional, but good practice

# Define your plugin class, inheriting from CustomLogger
class AccessAndTokenLimitPlugin(CustomLogger):

    def __init__(self, max_tokens_limit: int = 1024):
        """
        Initialize the plugin, potentially with configuration.
        """
        super().__init__() # Call the base class initializer
        self.max_tokens_limit = max_tokens_limit
        print(f"AccessAndTokenLimitPlugin Initialized with max_tokens={self.max_tokens_limit}")

    # Implement the async_pre_call_hook method
    async def async_pre_call_hook(
        self,
        user_api_key_dict: UserAPIKeyAuth,
        cache: DualCache,
        data: dict,
        call_type: Literal[
            "completion", "text_completion", "embeddings",
            "image_generation", "moderation", "audio_transcription",
            "pass_through_endpoint", "rerank" # Add other relevant types
        ],
    ) -> Optional[Union[dict, str]]: # Return type annotation (raising Exception is also an outcome)
        """
        This hook checks user access via metadata and enforces token limits.
        """
        litellm.print_verbose(f"------ AccessAndTokenLimitPlugin Start ------")
        litellm.print_verbose(f"Call Type: {call_type}")
        litellm.print_verbose(f"User/Key Info: {user_api_key_dict}") # Be careful logging sensitive info in prod

        # --- 1. Access Control ---
        # Check for 'plugin_access: true' in the key's metadata
        key_metadata = user_api_key_dict.metadata or {} # Safely get metadata
        if key_metadata.get("plugin_access") is not True:
            litellm.print_verbose(f"Access Denied: Key metadata missing 'plugin_access: true'. Metadata: {key_metadata}")
            # Reject with 403 Forbidden
            raise HTTPException(
                status_code=403, # Forbidden
                detail="Access Denied: Your API key does not have permission for this operation via the AccessAndTokenLimitPlugin."
            )

        litellm.print_verbose("Access Granted: 'plugin_access: true' found.")

        # --- 2. Enforce Max Tokens (Only for Completion Calls) ---
        if call_type == "completion":
            original_max_tokens = data.get("max_tokens")
            litellm.print_verbose(f"Original max_tokens: {original_max_tokens}")

            if original_max_tokens is None or original_max_tokens > self.max_tokens_limit:
                litellm.print_verbose(f"Enforcing max_tokens limit: Setting to {self.max_tokens_limit}")
                data["max_tokens"] = self.max_tokens_limit # Modify the data dictionary directly
            else:
                 litellm.print_verbose(f"Existing max_tokens ({original_max_tokens}) is within limit ({self.max_tokens_limit}).")

        # --- 3. Allow Request ---
        # If all checks pass, allow the request to proceed.
        # Returning None implicitly allows the request with any modifications made to 'data'.
        litellm.print_verbose(f"------ AccessAndTokenLimitPlugin End (Allowing) ------")
        return None # Or return data

# Create an instance of your plugin handler
# You can pass configuration here if needed
plugin_handler_instance = AccessAndTokenLimitPlugin(max_tokens_limit=512)

Step 2: Configure LiteLLM Proxy (config.yaml)

Modify your proxy config.yaml to tell LiteLLM to use your new plugin handler.

# config.yaml

model_list:
  - model_name: gpt-3.5-turbo
    litellm_params:
      model: openai/gpt-3.5-turbo
  # Add other models as needed

litellm_settings:
  # Register your plugin instance(s) here
  # Make sure the path (my_plugins.plugin_handler_instance) is correct
  # relative to where you run the `litellm` command.
  callbacks: [my_plugins.plugin_handler_instance]
  # You can add other callbacks too:
  # success_callback: ["langfuse", "my_other_logger.instance"]
  set_verbose: true # Recommended for debugging plugins

general_settings:
  # Add any other general settings
  master_key: sk-1234 # Example, use a real key management strategy

# Example key definitions (if not using DB/UI)
# Ensure keys used for testing have the required metadata
keys:
  - key: sk-plugin-allowed-key
    metadata:
      plugin_access: true # This key *should* pass the plugin check
      user_id: "user-allowed"
  - key: sk-plugin-denied-key
    metadata:
      # plugin_access is missing or false
      user_id: "user-denied"
  - key: sk-no-metadata-key
    user_id: "user-no-meta" # This key will also be denied by the plugin

Step 3: Run the Proxy and Test

  1. Save Files: Ensure my_plugins.py and config.yaml are saved correctly.
  2. Start Proxy: Run the proxy from your terminal, ensuring it can find your files.

    litellm --config config.yaml --logs
    
  3. Test Case 1: Allowed User (Max Tokens Enforced)

    • Use sk-plugin-allowed-key.
    • Make a /chat/completions call requesting more than the plugin's limit (512 in our example).
    curl -X POST http://localhost:4000/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer sk-plugin-allowed-key" \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": "Tell me a story."}],
        "max_tokens": 1000
      }'
    
*   **Expected Outcome:** The request should succeed. Check the proxy logs for `AccessAndTokenLimitPlugin` messages. You should see "Access Granted" and "Enforcing max_tokens limit: Setting to 512". The actual LLM call will use `max_tokens: 512`.
  1. Test Case 2: Allowed User (Max Tokens Within Limit)

    • Use sk-plugin-allowed-key.
    • Request max_tokens less than or equal to the limit (e.g., 200).
    curl -X POST http://localhost:4000/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer sk-plugin-allowed-key" \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": "Tell me a joke."}],
        "max_tokens": 200
      }'
    
*   **Expected Outcome:** The request should succeed. Logs should show "Access Granted" and "Existing max_tokens (200) is within limit (512)". The LLM call will use `max_tokens: 200`.
  1. Test Case 3: Denied User

    • Use sk-plugin-denied-key or sk-no-metadata-key.
    curl -X POST http://localhost:4000/chat/completions \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer sk-plugin-denied-key" \
      -d '{
        "model": "gpt-3.5-turbo",
        "messages": [{"role": "user", "content": "Will this work?"}]
      }'
    
*   **Expected Outcome:** The request should **fail** with an **HTTP 403 Forbidden** error. The response body should contain the detail message: `"Access Denied: Your API key does not have permission..."`. The proxy logs should show "Access Denied: Key metadata missing 'plugin_access: true'". The LLM will *not* be called.

Part 4: Common Plugin Patterns & Use Cases

The async_pre_call_hook enables a wide range of powerful plugin functionalities. Here are some common patterns:

1. Dynamic Model Routing:

  • Goal: Select the LLM model based on request parameters, user metadata, or other logic.
  • Logic: Inspect data (e.g., prompt length, specific keywords) or user_api_key_dict (e.g., user tier, budget remaining). Modify data['model'] before returning None.
  • Example Snippet:

    async def async_pre_call_hook(self, ..., data: dict, call_type: Literal[...]):
        if call_type == "completion":
            user_metadata = user_api_key_dict.metadata or {}
            if user_metadata.get("tier") == "premium":
                data['model'] = "gpt-4-turbo" # Route premium users to a better model
            elif len(str(data.get('messages', ''))) > 4000: # Example: check prompt length
                 data['model'] = "claude-3-haiku" # Route long prompts to a model with a large context
            # else: use default model passed in request
        return None
    

2. Input Validation & Sanitization:

  • Goal: Ensure the request payload meets specific criteria (e.g., required parameters, message format, disallowed content) or sanitize input (e.g., PII masking).
  • Logic: Inspect data. If validation fails, raise HTTPException(status_code=400, detail="Validation failed..."). For sanitization, modify data (e.g., data['messages']) in place.
  • Example Snippet (PII Masking Concept):

    async def async_pre_call_hook(self, ..., data: dict, call_type: Literal[...]):
        if call_type == "completion" and "messages" in data:
            for message in data['messages']:
                if message.get('role') == 'user':
                    # Replace this with actual PII detection/masking logic
                    if "email:" in message.get('content', ''):
                       # Caution: Simple string replacement is naive. Use proper libraries.
                       message['content'] = message['content'].replace("email:", "masked_email:")
                       litellm.print_verbose("Masked potential email in user message.")
        return None
    

    (Note: Proper PII masking requires robust libraries like presidio or custom logic).

3. Request Enrichment:

  • Goal: Automatically add parameters to the request based on user context or global settings.
  • Logic: Modify the data dictionary to add or update keys like temperature, max_tokens, or inject specific system prompts or metadata.
  • Example Snippet:

    async def async_pre_call_hook(self, ..., data: dict, call_type: Literal[...]):
        if call_type == "completion":
            # Ensure a user ID is always passed if available from the key
            if data.get("user") is None and user_api_key_dict.user_id:
                data["user"] = user_api_key_dict.user_id
    
            # Add default safety settings if not provided
            if data.get("safety_settings") is None:
                 data["safety_settings"] = {"block_hate_speech": True} # Example
        return None
    

4. Policy Enforcement (Budget, Permissions):

  • Goal: Reject requests that violate budget constraints or permission rules defined in key/team/user metadata.
  • Logic: Read budget/spend info or custom permission flags from user_api_key_dict. If a policy is violated, raise HTTPException (e.g., 403 Forbidden for permissions, 429 Too Many Requests for budget/rate limits).
  • Example Snippet (Simple Budget Check):

    async def async_pre_call_hook(self, ..., data: dict, call_type: Literal[...]):
        key_spend = user_api_key_dict.spend or 0.0
        key_max_budget = user_api_key_dict.max_budget
    
        if key_max_budget is not None and key_spend >= key_max_budget:
            raise HTTPException(
                status_code=429, # Use 429 for budget/rate limits
                detail=f"API Key budget limit exceeded. Spend: {key_spend}, Budget: {key_max_budget}"
            )
        # Add checks for team budget, user budget etc. if needed
        return None
    

    (Note: The built-in rate limiter is more sophisticated, using the cache for real-time checks. This is a simplified pre-check).

5. Custom Rejection Logic:

  • Goal: Reject requests based on specific content triggers or business rules, potentially providing a custom message back to the user.
  • Logic: Inspect data. If rejection criteria met:
    • For chat/completions: return "Your request was rejected because..."
    • For other types or more specific errors: raise HTTPException(...)
  • Example Snippet (String Return):

    from litellm.utils import get_formatted_prompt # Helper to get text
    
    async def async_pre_call_hook(self, ..., data: dict, call_type: Literal[...]):
        if call_type == "completion":
            try:
                prompt_text = get_formatted_prompt(data=data, call_type=call_type)
                if "forbidden phrase" in prompt_text.lower():
                    return "Request rejected due to containing a forbidden phrase." # LiteLLM formats this
            except Exception as e:
                 litellm.print_verbose(f"Error getting formatted prompt: {e}")
        return None
    

6. Cost Estimation & Prevention:

  • Goal: Estimate the potential cost of a request (e.g., based on prompt tokens for completions) and reject it if it exceeds a threshold defined in metadata.
  • Logic: Use litellm.token_counter (potentially adapting it for async context or calling synchronously carefully) on data['messages']. Calculate estimated cost using litellm.model_cost. Compare against a max_request_cost value from user_api_key_dict.metadata. Raise HTTPException(400) if too high.
  • Requires: Careful implementation to avoid adding too much latency. Might be better for very high-cost scenarios.

7. A/B Testing Setup:

  • Goal: Route a percentage of requests for a specific model to an alternative model or add specific metadata for tracking A/B tests.
  • Logic: Use random.random() or user ID hashing to determine if a request falls into the test group. Modify data['model'] or add flags to data['metadata'] (e.g., data['metadata']['ab_test_group'] = 'B'). Use the cache if you need sticky routing for a user session.

Part 5: Interaction with Other Hooks (Coordination is Key)

While async_pre_call_hook is powerful, it often works best in coordination with other hooks, especially for stateful plugins:

  • async_log_success_event / async_log_failure_event: Essential for updating state based on the outcome of the LLM call. The parallel_request_limiter uses these to decrement the request count and update TPM correctly after the call finishes. If your pre-call hook optimistically increments a counter or sets a flag, you'll likely need these post-call hooks to adjust or clear that state based on success or failure.
  • async_post_call_success_hook: Useful if your plugin needs to modify the response based on actions taken in the pre-call hook or based on final state. The rate limiter uses it to add X-RateLimit headers reflecting the final counts.
  • async_moderation_hook: Runs in parallel to the main LLM call. If your plugin involves a potentially slow check (like complex content moderation) that shouldn't block the main request, consider using this hook instead of async_pre_call_hook. Be aware that if async_moderation_hook raises an exception, it might override the actual LLM response.

Stateful Plugin Coordination Example (Conceptual Rate Limiter):

  1. async_pre_call_hook:
    • Read limits from user_api_key_dict.
    • Read current count from cache.
    • If count + 1 > limit, raise HTTPException(429).
    • If allowed, await cache.async_increment_cache(key, 1).
    • Return None.
  2. async_log_success_event:
    • await cache.async_increment_cache(key, -1) (decrement count).
    • Update TPM based on response_obj.usage.
  3. async_log_failure_event:
    • If exception is not the 429 raised by the pre-call hook:
      • await cache.async_increment_cache(key, -1) (decrement count).

Part 6: Advanced Techniques

  • Complex State with cache: Store more than just simple counters. Cache dictionaries or JSON strings representing user sessions, multi-step process states, or recent activity timestamps. Remember to use appropriate TTLs.
  • External API Calls: Your hook can await calls to external services (e.g., a policy decision point, a feature flag service, a PII detection API). Use extreme caution:
    • Keep these calls fast and reliable. A slow external call will add latency to every single request.
    • Implement robust error handling and timeouts for external calls. Don't let a failing external service bring down your proxy.
    • Consider caching results from external calls using the DualCache.
  • Complex Data Manipulation: For tasks like complex prompt templating or data transformation, you might call helper functions or classes defined elsewhere in your plugin file or imported modules. Keep the hook logic itself clean and focused on orchestration.

Part 7: Best Practices and Pitfalls

  • Performance is Paramount: Code within async_pre_call_hook adds latency before the LLM call. Keep it lean and fast. Avoid synchronous I/O, complex computations, or inefficient loops. Profile if necessary. Use asyncio.create_task for non-critical background updates (like cache writes in the rate limiter).
  • Robust Error Handling: Wrap your hook logic in try...except. An unhandled exception in your hook can potentially block all requests passing through the proxy. Log errors clearly within your hook for debugging (litellm.print_verbose or a proper logger). Consider what should happen if your hook encounters an unexpected error – should it reject the request (raise HTTPException(500)) or allow it to proceed (fail-open)?
  • Security Consciousness: Be extremely careful when modifying the data dictionary, especially data['messages'] or parameters that affect model behavior. Sanitize any external input used by your plugin. Avoid introducing prompt injection vulnerabilities.
  • Idempotency: While typically run once per request attempt, consider if interactions with the cache or external systems need to be idempotent (safe to run multiple times with the same outcome) if retries could potentially re-trigger the hook (though LiteLLM's retry logic usually happens after the initial hook execution fails).
  • Clear Modifications: Make it obvious what your plugin is doing to the data dictionary. Add comments. Ensure the modifications result in a valid payload for the target LLM.
  • Targeted Logic (call_type): Use call_type checks to ensure your plugin only affects the intended LLM operations.
  • Configuration: Avoid hardcoding values. Use the __init__ method of your handler class to accept configuration (like the max_tokens_limit in our example) or design your plugin to read settings from the user_api_key_dict.metadata.
  • Testing: Test your plugin thoroughly with various inputs, edge cases, valid keys, invalid keys, different call_types, and expected failure scenarios.

Part 8: Debugging Your Plugin

Debugging hooks can be tricky. Here are some strategies:

  1. Verbose Logging: Enable litellm_settings.set_verbose: true in your config.yaml. Use litellm.print_verbose(...) liberally inside your hook to print variable values, execution steps, and decisions.
  2. Standard Logging: Implement a proper Python logger within your plugin class for more structured logging.
  3. Print Statements: Good old print() statements can be helpful during initial development (remember to remove them later).
  4. Isolate the Hook: Temporarily comment out other callbacks or complex logic in config.yaml to isolate the behavior of your specific async_pre_call_hook.
  5. Test Specific Inputs: Craft curl requests or use API clients (like Postman, Insomnia) to send specific payloads that trigger different paths within your hook logic.
  6. Check UserAPIKeyAuth: Print the entire user_api_key_dict at the start of your hook during debugging to confirm you're receiving the expected context, limits, and metadata.
  7. Check data: Print the data dictionary before and after modifications to verify changes.
  8. Python Debugger: If running locally, you can use pdb or your IDE's debugger. Set a breakpoint (import pdb; pdb.set_trace()) inside your hook method. You'll need to attach the debugger to the running litellm process.

Conclusion: Build, Extend, Control

The async_pre_call_hook is the cornerstone of building impactful plugins for the LiteLLM Proxy. It transforms the proxy from a simple request forwarder into a highly customizable, intelligent layer capable of enforcing complex policies, dynamically adapting requests, and integrating bespoke logic directly into the LLM workflow.

By mastering its signature, understanding the contextual data it provides (UserAPIKeyAuth, cache, enriched data), and carefully controlling its output, you can build plugins that enhance security, manage costs effectively, improve user experience, and tailor the LiteLLM Proxy precisely to your application's needs.

Start simple, test thoroughly, adhere to best practices, and leverage the full power of this hook to unlock the next level of customization for your AI gateway. Happy plugging!