Querying Vision-Language Models in AetherMind

Using the API

AetherMind supports both the completions API and the chat completions API. However, we highly recommend using the chat completions API to avoid common formatting issues, such as misplaced whitespace, which can negatively impact model performance.

For Llama 3.2 Vision models, images must be included before text in the content field to ensure proper processing. Images can be provided via URL links or base64 encoding. Below are examples of both methods.


Chat Completions API

AetherMind vision-language models require a conversational setup and must use the chat completions API. These models are optimized for specific conversational formats. For example, Phi-3 models follow this structure:

SYSTEM: {system message}  
USER: <image>  
{user message}  
ASSISTANT:  

The <image> placeholder is a special token that indicates where the image should be referenced.

Example: Sending an Image via URL

import dashflow.client  

dashflow.client.api_key = "<DASHFLOW_API_KEY>"  

response = dashflow.client.ChatCompletion.create(  
  model="dashflow/models/phi-3-vision-128k-instruct",  
  messages=[{  
    "role": "user",  
    "content": [  
      {"type": "text", "text": "Can you describe this image?"},  
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}  
    ]  
  }]  
)  

print(response.choices[0].message.content)  

Example: Sending an Image as Base64

import dashflow.client  

dashflow.client.api_key = "<DASHFLOW_API_KEY>"  

response = dashflow.client.ChatCompletion.create(  
  model="dashflow/models/phi-3-vision-128k-instruct",  
  messages=[{  
    "role": "user",  
    "content": [  
      {"type": "text", "text": "Can you describe this image?"},  
      {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}  
    ]  
  }]  
)  

print(response.choices[0].message.content)  

Completions API

Advanced users can query the completions API directly. The <image> token must be manually inserted into the prompt, and images must be provided as an ordered list.

Example: Using Completions API

import dashflow.client  
import base64  

# Encode the image  
def encode_image(image_path):  
  with open(image_path, "rb") as image_file:  
    return base64.b64encode(image_file.read()).decode('utf-8')  

image_path = "your_image.jpg"  
image_base64 = encode_image(image_path)  

dashflow.client.api_key = "<DASHFLOW_API_KEY>"  

response = dashflow.client.ChatCompletion.create(  
  model="dashflow/models/phi-3-vision-128k-instruct",  
  messages=[{  
    "role": "user",  
    "content": [  
      {"type": "text", "text": "Can you describe this image?"},  
      {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}  
    ]  
  }]  
)  

print(response.choices[0].message.content)  

API Limitations

  • A single request cannot include more than 30 images .

  • Each image must be smaller than 5MB .

  • Image download must not exceed 1.5 seconds , or the request will fail.


Model Limitations

Currently, AetherMind supports Phi-3 Vision models for serverless deployment.


Managing Images

  • The Chat Completions API is stateless , meaning you must manage your own messages and images.

  • Images are cached to improve performance.

  • For long conversations, using image URLs instead of base64 reduces latency.


Cost Calculation

  • Each image is treated as a dynamic number of tokens based on resolution.

  • Typically, 1 image = 1K to 2.5K tokens .

  • Pricing follows the standard AetherMind text model rates (see pricing page for details).


FAQ

Can I fine-tune the image capabilities of VLMs? Not yet, but AetherMind is actively working on fine-tuning capabilities for Phi-3 Vision models.

Can AetherMind generate images? No. However, AetherMind supports image generation models, including:

  • Stable Diffusion

  • SSD-1B

  • Japanese Stable Diffusion

  • Playground v2

What image formats are supported? .png, .jpg, .jpeg, .gif, .bmp, .tiff, and .ppm

What is the maximum image size?

  • Base64 images must be under 10MB (after encoding).

  • Images from URLs must be under 5MB .

What happens to uploaded images? Images are not persisted beyond the server's active session.

How do rate limits work? AetherMind applies standard rate limits based on your API tier (see Pricing).

Can AetherMind models understand image metadata? No, image metadata is not processed. If needed, include metadata as text in the prompt.

Last updated