Vision API

The Vision API exposes three image analysis endpoints: description, OCR, and object detection. All endpoints accept either a URL or a base64 data URI and return structured JSON.

Common Input Formats

Parameter	Type	Description
`image`	string	URL to an image (HTTPS)
`image_base64`	string	Base64-encoded image (data URI)

Supported image formats: JPG, PNG, WEBP — max 20 MB. Exactly one of image or image_base64 must be provided.

vision.describe

Generate a detailed natural-language description of an image.

Endpoint: POST /v1/vision/describe Playground: Try vision.describe →

Request:

json

{
  "image": "https://example.com/photo.jpg",
  "prompt": "Describe the safety equipment visible",
  "language": "en",
  "detail": "high"
}

Parameter	Type	Required	Description
`prompt`	string	no	Custom description prompt
`language`	string	no	Output language (e.g. `en`, `zh`)
`detail`	`"low" \| "high"`	no	Detail level (default: auto)

Response:

json

{
  "description": "A sunlit outdoor café scene with several patrons seated under yellow parasols along a cobblestone street.",
  "model": "qwen-vl-max",
  "usage": { "input_tokens": 1200, "output_tokens": 150 }
}

vision.ocr

Extract text from an image with block-level position data.

Endpoint: POST /v1/vision/ocr Playground: Try vision.ocr →

Request:

json

{
  "image": "https://example.com/receipt.jpg",
  "language": "auto"
}

Parameter	Type	Required	Description
`language`	string	no	OCR language hint (default: `auto`)

Response:

json

{
  "text": "ACME STORE\n123 Main St\nTotal: $42.99",
  "blocks": [
    { "text": "ACME STORE", "bbox": [50, 20, 300, 60], "confidence": 0.99 },
    { "text": "123 Main St", "bbox": [50, 65, 280, 95], "confidence": 0.97 },
    { "text": "Total: $42.99", "bbox": [50, 180, 260, 210], "confidence": 0.98 }
  ],
  "model": "qwen-vl-max",
  "usage": { "input_tokens": 800, "output_tokens": 200 }
}

Each bbox uses [x1, y1, x2, y2] in pixel coordinates relative to the source image.

vision.detect

Detect and locate objects in images with bounding boxes and confidence scores.

Endpoint: POST /v1/vision/detect Playground: Try vision.detect →

Request:

json

{
  "image": "https://example.com/site.jpg",
  "classes": ["person", "hard_hat", "vehicle"],
  "confidence_threshold": 0.5
}

Parameter	Type	Required	Description
`classes`	string[]	no	Object classes to detect
`confidence_threshold`	number	no	Minimum confidence 0–1 (default: 0.5)

Response:

json

{
  "objects": [
    {
      "class": "person",
      "confidence": 0.95,
      "bbox": [100, 200, 300, 500],
      "attributes": {}
    },
    {
      "class": "hard_hat",
      "confidence": 0.88,
      "bbox": [120, 180, 200, 230],
      "attributes": { "color": "yellow" }
    }
  ],
  "model": "qwen-vl-max",
  "usage": { "input_tokens": 1500, "output_tokens": 300 }
}

Overview Video API