Video API

The Video API processes videos through a multi-step pipeline (detect, OCR, feature extraction, …) and streams results frame-by-frame as SSE chunks. Tasks are asynchronous: POST to enqueue, then poll status or subscribe to a stream. Long videos run in the background — your client never has to keep an HTTP request open for minutes.

Try it: open any project's Video Tasks tab to submit a clip and watch results stream in.

When to use it

The clip is more than a few seconds, or you want progressive results before processing finishes.
You need stable per-object track_ids across frames (object tracking, dwell-time analysis, re-identification setup).
You want to chain multiple vision models on the same input (e.g. detect pedestrians, then OCR vehicle plates, then extract feature vectors).

For single images, use vision.describe / vision.ocr / vision.detect — they're synchronous and lower-latency.

Endpoints

Method + Path	Purpose
`POST /v1/video/process`	Enqueue a task. Returns `202 { id, stream_url }`.
`GET /v1/video/tasks`	List the project's tasks. Filters: `status`, `from`, `to`, `cursor`, `limit`.
`GET /v1/video/tasks/:id`	Fetch task status. Add `?include=request` to echo the request body.
`GET /v1/video/tasks/:id/stream`	SSE re-broadcast — chunks for live frames + replay of past chunks. Honors `Last-Event-ID` byte offsets on reconnect.
`GET /v1/video/tasks/:id/results`	Final aggregated result (per-track summary). 404 until `status="succeeded"`.
`POST /v1/video/tasks/:id/cancel`	Request cancellation. Worker aborts the upstream and finalizes within one chunk window.

video.process — `POST /v1/video/process`

Request:

json

{
  "input": [
    {
      "type": "video_url",
      "url": "https://example.com/clip.mp4",
      "sampling": { "mode": "fps", "value": 5 },
      "range": { "start": 0, "end": 30, "unit": "seconds" },
      "max_frames": 300
    }
  ],
  "steps": [
    { "model": "detect", "parameters": { "classes": ["pedestrian"] } },
    { "model": "feature" }
  ],
  "model": "qwen-vl-max"
}

Field	Type	Required	Description
`input[].type`	`"video_url"` \| `"video_base64"`	yes	One element only in v1. Base64 inputs are staged to S3 before the worker fetches.
`input[].url`	string	yes (for `video_url`)	Public URL or S3 presigned GET.
`input[].sampling`	object	no	`{ mode: "fps", value: 5 }`, `{ mode: "keyframe" }`, or `{ mode: "interval", value: 3 }`. Default: 1 FPS.
`input[].range`	object	no	`{ start?, end?, unit?: "seconds" \| "frames" }`. Default: full video.
`input[].max_frames`	number	no	Stop after this count. Default: 3000.
`steps[]`	array	yes	Sequential pipeline steps; each names a model + optional `parameters` and `when` filters.
`model`	string	no	Provider routing hint. Echoed back in the first SSE chunk.

Response: 202 Accepted

json

{
  "id": "vt_01HZ...",
  "status": "pending",
  "stream_url": "/v1/video/tasks/vt_01HZ.../stream"
}

Pipeline steps

Each steps[] entry names a vision model and optionally filters which targets it operates on:

json

{
  "model": "ocr",
  "when": { "class_is": ["automobile", "truck"], "min_confidence": 0.7 }
}

Targets that don't match when pass through unchanged to the next step. This lets you build conditional flows in a flat list — e.g. detect everything → OCR only on vehicles → embed all targets.

Streaming results

GET /v1/video/tasks/:id/stream is Content-Type: text/event-stream. Each event is one frame:

text

data: {"id":"mm-pipe-abc","object":"mm.pipeline.chunk",
       "stream_info":{"frame_index":42,"total_frames":150,
                      "decoded_frames":43,"dropped_frames":0,
                      "fps":25.0,"elapsed_ms":1720},
       "delta":{"results":[{"index":0,"targets":[
         {"track_id":1,"class":"pedestrian","confidence":0.91,
          "roi":{"left":55,"top":98,"width":60,"height":180}}]}]},
       "finish_reason":null}

The terminal chunk carries a non-null finish_reason ("stop", "max_frames", "error", or "cancelled") and a final usage block, followed by:

text

data: [DONE]

`stream_info` fields

Field	Type	Description
`frame_index`	int	Current frame index (monotonically non-decreasing).
`total_frames`	int \| null	Total frames in the video; null for live streams.
`decoded_frames`	int	Cumulative count of successfully decoded frames.
`dropped_frames`	int	Cumulative count of decode failures + sampling skips.
`fps`	float	Source video frame rate.
`elapsed_ms`	int	Wall-clock time since stream start.

Reconnection

On reconnect, send Last-Event-ID: <byteOffset> (or ?offset=<n>) — the server replays from that byte and then tails live. The number is the cumulative byte count of all previously-emitted SSE bytes; you maintain it client-side as you read.

Track aggregation

For video inputs, every target carries a stable track_id — the same physical object keeps the same id across frames. This lets you compute trajectories, dwell-time, and (with feature steps) cross-camera handoff.

GET /v1/video/tasks/:id/results returns the rolled-up per-track summary once the task succeeds:

json

{
  "id": "vt_01HZ...",
  "object": "mm.pipeline.result",
  "model": "qwen-vl-max",
  "usage": { "processed_frames": 150, "processing_time_ms": 6200 },
  "tracks": [
    {
      "track_id": 1,
      "class": "pedestrian",
      "first_frame": 0,
      "last_frame": 149,
      "frames_seen": 30,
      "best_confidence": 0.93,
      "feature_digest": "dim=512;norm=14.2031"
    }
  ]
}

Quick start

bash

# 1. Create a task
curl -X POST https://api.visowork.com/v1/video/process \
  -H "Authorization: Bearer $VISOWORK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": [{ "type": "video_url", "url": "https://example.com/clip.mp4",
                "sampling": { "mode": "fps", "value": 5 }, "max_frames": 300 }],
    "steps": [
      { "model": "detect", "parameters": { "classes": ["pedestrian"] } },
      { "model": "feature" }
    ]
  }'
# → { "id": "vt_01HZ...", "status": "pending", "stream_url": "/v1/video/tasks/vt_01HZ.../stream" }

# 2. Stream chunks live
curl -N https://api.visowork.com/v1/video/tasks/vt_01HZ.../stream \
  -H "Authorization: Bearer $VISOWORK_API_KEY"

# 3. Read the final aggregate when done
curl https://api.visowork.com/v1/video/tasks/vt_01HZ.../results \
  -H "Authorization: Bearer $VISOWORK_API_KEY"

Failover and retries

Pre-stream errors (4xx/5xx before the first chunk) trigger failover to the next-priority provider, up to 2 retries.
Mid-stream errors are terminal — the task is marked failed and the partial archive is retained. Resume-from-offset is on the roadmap.

Cancellation

POST /v1/video/tasks/:id/cancel flips the task to cancelling. The worker observes the next chunk boundary, aborts the upstream fetch, and finalizes status cancelled. Already-archived chunks remain accessible via the SSE replay endpoint.

Vision API Vectors API

Video API

When to use it

Endpoints

video.process — POST /v1/video/process

Pipeline steps

Streaming results

stream_info fields

Reconnection

Track aggregation

Quick start

Failover and retries

Cancellation

video.process — `POST /v1/video/process`

`stream_info` fields