Inference: Querying Text Models in AetherMind

AetherMind provides a powerful, OpenAI-compatible REST API for querying text models. Users can interact with it in multiple ways:

AetherMind Python Client Library – Seamless integration with Python-based applications.
Web Console – A user-friendly interface for testing models interactively.
LangChain – Effortless connection to knowledge-based applications.
Direct API Invocation – Use your preferred programming language or tools.
OpenAI Python Client – Compatible with existing OpenAI-based implementations.

Using the Web Console

All AetherMind models are accessible via the web console at AetherMind.ai. Users can select a model and enter a prompt with additional request parameters in the playground.

Non-chat models use the completions API, which directly processes input.
Chat models (instruct models) use the chat completions API, which formats input according to the model's conversation style.
Advanced users can revert to the completions API by disabling the "Use chat template" option.

Using the API

Chat Completions API

Models with a conversation configuration support the chat completions API, enabling structured conversations optimized for different styles. For example:

python

from dashflow.client import DashFlow  

client = DashFlow(api_key="<DASHFLOW_API_KEY>")  
response = client.chat.completions.create(  
  model="accounts/dashflow/models/llama-v3-8b-instruct",  
  messages=[{  
    "role": "user",  
    "content": "Say this is a test",  
  }],  
)  
print(response.choices[0].message.content)

Overriding the System Prompt

Some conversation styles include a default system prompt (e.g., "You are a helpful AI."). Users can override this by setting the first message role as "system":

python

[  
  {  
    "role": "system",  
    "content": "You are a pirate."  
  },  
  {  
    "role": "user",  
    "content": "Hello, what is your name?"  
  }  
]

To remove the system prompt, set the content field to an empty string.

Completions API

Text models generate responses based on a given prompt. The model continues generating until reaching the maximum output tokens or a special end-of-sequence (EOS) token.

python

from dashflow.client import DashFlow  

client = DashFlow(api_key="<DASHFLOW_API_KEY>")  
response = client.completion.create(  
  model="accounts/dashflow/models/llama-v3-8b-instruct",  
  prompt="Say this is a test",  
)  
print(response.choices[0].text)

Advanced Options

Streaming – Stream results in real-time for chat applications.
Async Mode – Supports asynchronous requests for optimized performance.
Multiple Choices – Generate multiple output variations per request.
Max Tokens – Define the maximum number of tokens per response.
Temperature – Control response randomness (higher = more creativity).
Top-p & Top-k Sampling – Alternative sampling methods for response variety.
Repetition Penalty – Reduce redundant outputs and looping behavior.
Mirostat Algorithm – Fine-tune output unpredictability.
Logit Bias – Adjust the likelihood of certain words appearing.

Debugging Options

Ignore EOS – Prevents the model from stopping at an end-of-sequence token.
Logprobs – Returns token probabilities for debugging.
Echo – Displays input and output together for verification.
Raw Output – Provides full debugging details of model response.

Tokenization in AetherMind

AetherMind models process text in tokens, where token count affects cost and response length. The actual number of tokens used in a request is returned in the usage field of the API response.

For more details, visit the AetherMind API documentation at AetherMind.ai/docs.

NextQuerying Vision-Language Models in AetherMind

Last updated 6 months ago