Prompt Caching in AetherMind

Prompt caching is a performance-enhancing feature that enables AetherMind to deliver faster responses for requests with prompts that share common prefixes. In many cases, this can reduce the time to first token (TTFT) by up to 80% , significantly improving efficiency and user experience.

Prompt caching is enabled by default for all AetherMind models and deployments. For dedicated deployments, this feature not only accelerates response times but also frees up resources, allowing for higher throughput on the same hardware. Additionally, Enterprise plan users gain access to advanced configuration options to further optimize cache performance.

Using Prompt Caching

Common Use Cases

Requests sent to large language models (LLMs) often include repetitive or shared portions of prompts. Examples include:

Long system prompts : Detailed instructions or guidelines.
Tool descriptions : Information about available tools for function calling.
Conversation history : Growing dialogues in chat-based applications.
Shared context : User-specific data, such as a current file for a coding assistant.

By caching these shared prefixes, AetherMind avoids reprocessing them, enabling faster output generation.

Structuring Prompts for Caching

To maximize the benefits of prompt caching:

Place static content (e.g., instructions, examples) at the beginning of the prompt.
Add variable content (e.g., user-specific inputs) at the end.

Important Notes:

For function-calling models, tool descriptions are treated as part of the prompt.
For vision-language models, images are currently not cached , though this may change in future updates.

How It Works

AetherMind automatically identifies the longest prefix of a request that exists in the cache and reuses it. The remaining portion of the prompt is processed as usual.

Cached prompts are stored for future reuse, typically remaining in the cache for several minutes to hours, depending on factors like model type, load levels, and deployment configuration.
Older prompts are evicted from the cache first to make room for new ones.

Prompt caching does not alter the model's output. Each generation is sampled independently, ensuring identical results whether or not caching is used.

Monitoring Cache Performance

For dedicated deployments, information about prompt caching is provided in the response headers:

dashflow-prompt-tokens: Total number of tokens in the prompt.
dashflow-cached-prompt-tokens: Number of tokens reused from the cache.

Aggregated metrics are also accessible via the usage dashboard, offering insights into cache utilization and performance.

Data Privacy

Serverless Deployments : Separate caches are maintained for each AetherMind account to prevent data leakage and mitigate timing attacks.
Dedicated Deployments : By default, a single cache is shared across all requests. While this does not affect outputs, there is a minor risk of timing attacks where an adversary could infer whether a prompt is cached based on response times.

To ensure complete isolation, you can use:

The x-prompt-cache-isolation-key header.
The prompt_cache_isolation_key field in the request body.

These accept an arbitrary string that acts as an additional cache key, ensuring no sharing occurs between requests with different keys.

Limiting or Disabling Caching

If needed, you can control caching behavior using the prompt_cache_max_len field in the request body:

Set "prompt_cache_max_len": 0 to disable caching entirely.
Limit the maximum prefix length (in tokens) considered for caching.

This is rarely necessary in real-world applications but can be useful for benchmarking dedicated deployments.

Advanced: Cache Locality for Enterprise Deployments

Enterprise users can enhance cache performance by leveraging session affinity.

Enable Session Affinity : Create or update your deployment with the following flag:
bashCopy1firectl create deployment ... --enable-session-affinity
Pass a Session Identifier : Include an opaque identifier (e.g., a user ID or session ID) in the user field of the request body or in the x-session-affinity header. AetherMind will route requests with the same identifier to the same server, improving cache hit rates and reducing response times.

Best practices for choosing an identifier:

Group requests with long shared prompt prefixes, such as a chat session with the same user or an assistant working within the same shared context.

PreviousUerying Embedding Models in AetherMind

Last updated 4 months ago