Vision API

The Vision API exposes three image analysis endpoints: description, OCR, and object detection. All endpoints accept either a URL or a base64 data URI and return structured JSON.

Common Input Formats

ParameterTypeDescription
imagestringURL to an image (HTTPS)
image_base64stringBase64-encoded image (data URI)

Supported image formats: JPG, PNG, WEBP — max 20 MB. Exactly one of image or image_base64 must be provided.

vision.describe

Generate a detailed natural-language description of an image.

Endpoint: POST /v1/vision/describe Playground: Try vision.describe →

Request:

json
{
  "image": "https://example.com/photo.jpg",
  "prompt": "Describe the safety equipment visible",
  "language": "en",
  "detail": "high"
}
ParameterTypeRequiredDescription
promptstringnoCustom description prompt
languagestringnoOutput language (e.g. en, zh)
detail"low" | "high"noDetail level (default: auto)

Response:

json
{
  "description": "A sunlit outdoor café scene with several patrons seated under yellow parasols along a cobblestone street.",
  "model": "qwen-vl-max",
  "usage": { "input_tokens": 1200, "output_tokens": 150 }
}

vision.ocr

Extract text from an image with block-level position data.

Endpoint: POST /v1/vision/ocr Playground: Try vision.ocr →

Request:

json
{
  "image": "https://example.com/receipt.jpg",
  "language": "auto"
}
ParameterTypeRequiredDescription
languagestringnoOCR language hint (default: auto)

Response:

json
{
  "text": "ACME STORE\n123 Main St\nTotal: $42.99",
  "blocks": [
    { "text": "ACME STORE", "bbox": [50, 20, 300, 60], "confidence": 0.99 },
    { "text": "123 Main St", "bbox": [50, 65, 280, 95], "confidence": 0.97 },
    { "text": "Total: $42.99", "bbox": [50, 180, 260, 210], "confidence": 0.98 }
  ],
  "model": "qwen-vl-max",
  "usage": { "input_tokens": 800, "output_tokens": 200 }
}

Each bbox uses [x1, y1, x2, y2] in pixel coordinates relative to the source image.

vision.detect

Detect and locate objects in images with bounding boxes and confidence scores.

Endpoint: POST /v1/vision/detect Playground: Try vision.detect →

Request:

json
{
  "image": "https://example.com/site.jpg",
  "classes": ["person", "hard_hat", "vehicle"],
  "confidence_threshold": 0.5
}
ParameterTypeRequiredDescription
classesstring[]noObject classes to detect
confidence_thresholdnumbernoMinimum confidence 0–1 (default: 0.5)

Response:

json
{
  "objects": [
    {
      "class": "person",
      "confidence": 0.95,
      "bbox": [100, 200, 300, 500],
      "attributes": {}
    },
    {
      "class": "hard_hat",
      "confidence": 0.88,
      "bbox": [120, 180, 200, 230],
      "attributes": { "color": "yellow" }
    }
  ],
  "model": "qwen-vl-max",
  "usage": { "input_tokens": 1500, "output_tokens": 300 }
}