Back to Tools

vision.detect

Generates a detailed text description of visual content in images.

Input

Drop file or click to upload

JPG, PNG, WEBP — max 20MB

Output

Results will appear here after execution.

Example Output

{
  "objects": [
    { "class": "person", "confidence": 0.95,
      "bbox": [100, 50, 300, 400], "attributes": {} }
  ],
  "model": "qwen-vl-max",
  "usage": { "input_tokens": 1024, "output_tokens": 96 }
}