Video API
The Video API processes videos through a multi-step pipeline (detect, OCR, feature extraction, …) and streams results frame-by-frame as SSE chunks. Tasks are asynchronous: POST to enqueue, then poll status or subscribe to a stream. Long videos run in the background — your client never has to keep an HTTP request open for minutes.
Try it: open any project's Video Tasks tab to submit a clip and watch results stream in.
When to use it
- The clip is more than a few seconds, or you want progressive results before processing finishes.
- You need stable per-object
track_ids across frames (object tracking, dwell-time analysis, re-identification setup). - You want to chain multiple vision models on the same input (e.g. detect pedestrians, then OCR vehicle plates, then extract feature vectors).
For single images, use vision.describe / vision.ocr / vision.detect — they're synchronous and lower-latency.
Endpoints
| Method + Path | Purpose |
|---|---|
POST /v1/video/process | Enqueue a task. Returns 202 { id, stream_url }. |
GET /v1/video/tasks | List the project's tasks. Filters: status, from, to, cursor, limit. |
GET /v1/video/tasks/:id | Fetch task status. Add ?include=request to echo the request body. |
GET /v1/video/tasks/:id/stream | SSE re-broadcast — chunks for live frames + replay of past chunks. Honors Last-Event-ID byte offsets on reconnect. |
GET /v1/video/tasks/:id/results | Final aggregated result (per-track summary). 404 until status="succeeded". |
POST /v1/video/tasks/:id/cancel | Request cancellation. Worker aborts the upstream and finalizes within one chunk window. |
video.process — POST /v1/video/process
Request:
{
"input": [
{
"type": "video_url",
"url": "https://example.com/clip.mp4",
"sampling": { "mode": "fps", "value": 5 },
"range": { "start": 0, "end": 30, "unit": "seconds" },
"max_frames": 300
}
],
"steps": [
{ "model": "detect", "parameters": { "classes": ["pedestrian"] } },
{ "model": "feature" }
],
"model": "qwen-vl-max"
}| Field | Type | Required | Description |
|---|---|---|---|
input[].type | "video_url" | "video_base64" | yes | One element only in v1. Base64 inputs are staged to S3 before the worker fetches. |
input[].url | string | yes (for video_url) | Public URL or S3 presigned GET. |
input[].sampling | object | no | { mode: "fps", value: 5 }, { mode: "keyframe" }, or { mode: "interval", value: 3 }. Default: 1 FPS. |
input[].range | object | no | { start?, end?, unit?: "seconds" | "frames" }. Default: full video. |
input[].max_frames | number | no | Stop after this count. Default: 3000. |
steps[] | array | yes | Sequential pipeline steps; each names a model + optional parameters and when filters. |
model | string | no | Provider routing hint. Echoed back in the first SSE chunk. |
Response: 202 Accepted
{
"id": "vt_01HZ...",
"status": "pending",
"stream_url": "/v1/video/tasks/vt_01HZ.../stream"
}Pipeline steps
Each steps[] entry names a vision model and optionally filters which targets it operates on:
{
"model": "ocr",
"when": { "class_is": ["automobile", "truck"], "min_confidence": 0.7 }
}Targets that don't match when pass through unchanged to the next step. This lets you build conditional flows in a flat list — e.g. detect everything → OCR only on vehicles → embed all targets.
Streaming results
GET /v1/video/tasks/:id/stream is Content-Type: text/event-stream. Each event is one frame:
data: {"id":"mm-pipe-abc","object":"mm.pipeline.chunk",
"stream_info":{"frame_index":42,"total_frames":150,
"decoded_frames":43,"dropped_frames":0,
"fps":25.0,"elapsed_ms":1720},
"delta":{"results":[{"index":0,"targets":[
{"track_id":1,"class":"pedestrian","confidence":0.91,
"roi":{"left":55,"top":98,"width":60,"height":180}}]}]},
"finish_reason":null}
The terminal chunk carries a non-null finish_reason ("stop", "max_frames", "error", or "cancelled") and a final usage block, followed by:
data: [DONE]
stream_info fields
| Field | Type | Description |
|---|---|---|
frame_index | int | Current frame index (monotonically non-decreasing). |
total_frames | int | null | Total frames in the video; null for live streams. |
decoded_frames | int | Cumulative count of successfully decoded frames. |
dropped_frames | int | Cumulative count of decode failures + sampling skips. |
fps | float | Source video frame rate. |
elapsed_ms | int | Wall-clock time since stream start. |
Reconnection
On reconnect, send Last-Event-ID: <byteOffset> (or ?offset=<n>) — the server replays from that byte and then tails live. The number is the cumulative byte count of all previously-emitted SSE bytes; you maintain it client-side as you read.
Track aggregation
For video inputs, every target carries a stable track_id — the same physical object keeps the same id across frames. This lets you compute trajectories, dwell-time, and (with feature steps) cross-camera handoff.
GET /v1/video/tasks/:id/results returns the rolled-up per-track summary once the task succeeds:
{
"id": "vt_01HZ...",
"object": "mm.pipeline.result",
"model": "qwen-vl-max",
"usage": { "processed_frames": 150, "processing_time_ms": 6200 },
"tracks": [
{
"track_id": 1,
"class": "pedestrian",
"first_frame": 0,
"last_frame": 149,
"frames_seen": 30,
"best_confidence": 0.93,
"feature_digest": "dim=512;norm=14.2031"
}
]
}Quick start
# 1. Create a task
curl -X POST https://api.visowork.com/v1/video/process \
-H "Authorization: Bearer $VISOWORK_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"input": [{ "type": "video_url", "url": "https://example.com/clip.mp4",
"sampling": { "mode": "fps", "value": 5 }, "max_frames": 300 }],
"steps": [
{ "model": "detect", "parameters": { "classes": ["pedestrian"] } },
{ "model": "feature" }
]
}'
# → { "id": "vt_01HZ...", "status": "pending", "stream_url": "/v1/video/tasks/vt_01HZ.../stream" }
# 2. Stream chunks live
curl -N https://api.visowork.com/v1/video/tasks/vt_01HZ.../stream \
-H "Authorization: Bearer $VISOWORK_API_KEY"
# 3. Read the final aggregate when done
curl https://api.visowork.com/v1/video/tasks/vt_01HZ.../results \
-H "Authorization: Bearer $VISOWORK_API_KEY"Failover and retries
- Pre-stream errors (4xx/5xx before the first chunk) trigger failover to the next-priority provider, up to 2 retries.
- Mid-stream errors are terminal — the task is marked
failedand the partial archive is retained. Resume-from-offset is on the roadmap.
Cancellation
POST /v1/video/tasks/:id/cancel flips the task to cancelling. The worker observes the next chunk boundary, aborts the upstream fetch, and finalizes status cancelled. Already-archived chunks remain accessible via the SSE replay endpoint.