Video API

The Video API processes videos through a multi-step pipeline (detect, OCR, feature extraction, …) and streams results frame-by-frame as SSE chunks. Tasks are asynchronous: POST to enqueue, then poll status or subscribe to a stream. Long videos run in the background — your client never has to keep an HTTP request open for minutes.

Try it: open any project's Video Tasks tab to submit a clip and watch results stream in.

When to use it

  • The clip is more than a few seconds, or you want progressive results before processing finishes.
  • You need stable per-object track_ids across frames (object tracking, dwell-time analysis, re-identification setup).
  • You want to chain multiple vision models on the same input (e.g. detect pedestrians, then OCR vehicle plates, then extract feature vectors).

For single images, use vision.describe / vision.ocr / vision.detect — they're synchronous and lower-latency.

Endpoints

Method + PathPurpose
POST /v1/video/processEnqueue a task. Returns 202 { id, stream_url }.
GET /v1/video/tasksList the project's tasks. Filters: status, from, to, cursor, limit.
GET /v1/video/tasks/:idFetch task status. Add ?include=request to echo the request body.
GET /v1/video/tasks/:id/streamSSE re-broadcast — chunks for live frames + replay of past chunks. Honors Last-Event-ID byte offsets on reconnect.
GET /v1/video/tasks/:id/resultsFinal aggregated result (per-track summary). 404 until status="succeeded".
POST /v1/video/tasks/:id/cancelRequest cancellation. Worker aborts the upstream and finalizes within one chunk window.

video.process — POST /v1/video/process

Request:

json
{
  "input": [
    {
      "type": "video_url",
      "url": "https://example.com/clip.mp4",
      "sampling": { "mode": "fps", "value": 5 },
      "range": { "start": 0, "end": 30, "unit": "seconds" },
      "max_frames": 300
    }
  ],
  "steps": [
    { "model": "detect", "parameters": { "classes": ["pedestrian"] } },
    { "model": "feature" }
  ],
  "model": "qwen-vl-max"
}
FieldTypeRequiredDescription
input[].type"video_url" | "video_base64"yesOne element only in v1. Base64 inputs are staged to S3 before the worker fetches.
input[].urlstringyes (for video_url)Public URL or S3 presigned GET.
input[].samplingobjectno{ mode: "fps", value: 5 }, { mode: "keyframe" }, or { mode: "interval", value: 3 }. Default: 1 FPS.
input[].rangeobjectno{ start?, end?, unit?: "seconds" | "frames" }. Default: full video.
input[].max_framesnumbernoStop after this count. Default: 3000.
steps[]arrayyesSequential pipeline steps; each names a model + optional parameters and when filters.
modelstringnoProvider routing hint. Echoed back in the first SSE chunk.

Response: 202 Accepted

json
{
  "id": "vt_01HZ...",
  "status": "pending",
  "stream_url": "/v1/video/tasks/vt_01HZ.../stream"
}

Pipeline steps

Each steps[] entry names a vision model and optionally filters which targets it operates on:

json
{
  "model": "ocr",
  "when": { "class_is": ["automobile", "truck"], "min_confidence": 0.7 }
}

Targets that don't match when pass through unchanged to the next step. This lets you build conditional flows in a flat list — e.g. detect everything → OCR only on vehicles → embed all targets.

Streaming results

GET /v1/video/tasks/:id/stream is Content-Type: text/event-stream. Each event is one frame:

text
data: {"id":"mm-pipe-abc","object":"mm.pipeline.chunk",
       "stream_info":{"frame_index":42,"total_frames":150,
                      "decoded_frames":43,"dropped_frames":0,
                      "fps":25.0,"elapsed_ms":1720},
       "delta":{"results":[{"index":0,"targets":[
         {"track_id":1,"class":"pedestrian","confidence":0.91,
          "roi":{"left":55,"top":98,"width":60,"height":180}}]}]},
       "finish_reason":null}

The terminal chunk carries a non-null finish_reason ("stop", "max_frames", "error", or "cancelled") and a final usage block, followed by:

text
data: [DONE]

stream_info fields

FieldTypeDescription
frame_indexintCurrent frame index (monotonically non-decreasing).
total_framesint | nullTotal frames in the video; null for live streams.
decoded_framesintCumulative count of successfully decoded frames.
dropped_framesintCumulative count of decode failures + sampling skips.
fpsfloatSource video frame rate.
elapsed_msintWall-clock time since stream start.

Reconnection

On reconnect, send Last-Event-ID: <byteOffset> (or ?offset=<n>) — the server replays from that byte and then tails live. The number is the cumulative byte count of all previously-emitted SSE bytes; you maintain it client-side as you read.

Track aggregation

For video inputs, every target carries a stable track_id — the same physical object keeps the same id across frames. This lets you compute trajectories, dwell-time, and (with feature steps) cross-camera handoff.

GET /v1/video/tasks/:id/results returns the rolled-up per-track summary once the task succeeds:

json
{
  "id": "vt_01HZ...",
  "object": "mm.pipeline.result",
  "model": "qwen-vl-max",
  "usage": { "processed_frames": 150, "processing_time_ms": 6200 },
  "tracks": [
    {
      "track_id": 1,
      "class": "pedestrian",
      "first_frame": 0,
      "last_frame": 149,
      "frames_seen": 30,
      "best_confidence": 0.93,
      "feature_digest": "dim=512;norm=14.2031"
    }
  ]
}

Quick start

bash
# 1. Create a task
curl -X POST https://api.visowork.com/v1/video/process \
  -H "Authorization: Bearer $VISOWORK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "input": [{ "type": "video_url", "url": "https://example.com/clip.mp4",
                "sampling": { "mode": "fps", "value": 5 }, "max_frames": 300 }],
    "steps": [
      { "model": "detect", "parameters": { "classes": ["pedestrian"] } },
      { "model": "feature" }
    ]
  }'
# → { "id": "vt_01HZ...", "status": "pending", "stream_url": "/v1/video/tasks/vt_01HZ.../stream" }

# 2. Stream chunks live
curl -N https://api.visowork.com/v1/video/tasks/vt_01HZ.../stream \
  -H "Authorization: Bearer $VISOWORK_API_KEY"

# 3. Read the final aggregate when done
curl https://api.visowork.com/v1/video/tasks/vt_01HZ.../results \
  -H "Authorization: Bearer $VISOWORK_API_KEY"

Failover and retries

  • Pre-stream errors (4xx/5xx before the first chunk) trigger failover to the next-priority provider, up to 2 retries.
  • Mid-stream errors are terminal — the task is marked failed and the partial archive is retained. Resume-from-offset is on the roadmap.

Cancellation

POST /v1/video/tasks/:id/cancel flips the task to cancelling. The worker observes the next chunk boundary, aborts the upstream fetch, and finalizes status cancelled. Already-archived chunks remain accessible via the SSE replay endpoint.