Typical Usage#
This section describes a typical use of VSS.
Deployment#
It starts with first setting up the environment and then deploying VSS using either Helm Chart or Docker Compose method.
VSS provides many options to customize the deployment to your needs, such as toggling features like audio, CV pipeline, and Guardrails on and off, switch the various VLM and LLM models, and configure the deployment topology based on the available system resources.
For more details, refer to:
File and Live Stream Ingestion and Summarization and Alerts#
After VSS is deployed, you can start summarizing files and live streams. This is the first operation that must be performed for any source as this enables the ingestion of the source into the VSS.
This can be done by using the Gradio UI, the reference Python CLI or by using the REST API programmatically.
Add a file or live stream to the VSS using:
For the Gradio UI, selecting a preloaded example or uploading a file or entering a live stream URL
For the reference CLI, use Add File or Add Live Stream commands
For the REST API, use /files [POST] or /live-stream [POST] endpoints
Supported formats include:
H.264 / H.265 video codecs
OPUS / AAC audio codecs
mp4 / mkv container formats
Certain codecs require installing additional codecs.
Summarize the source. The summarization API is the most important API as it configures the ingestion of the source. You can configure its various parameters.
When the summarize API is called, VLM captions will be generated for chunks of the input source as configured using the chunk_duration
parameter (CHUNK SIZE in UI) and chunk_overlap_duration parameters.
A higher chunk size will result in faster ingestion because fewer chunks are to be processed while a smaller chunk size can result in more accurate captions and detection of fast events.
VLM’s prompt
(PROMPT in UI) must be tuned for the use case to ensure accurate captions. The system_prompt
parameter can be used to provide additional context or instructions to the VLM model. To enable reasoning with Cosmos Reason1, add <think></think>
and <answer></answer>
tags to the system prompt. Other VLM parameters like temperature
, top_p
, max_tokens
, top_k
, vlm_input_width
, vlm_input_height
, num_frames_per_chunk
can be tuned
for accuracy / speed trade-off.
Optionally, CV pipeline for Set-of-Marks prompting for VLM and Audio Transcription can be enabled. This will require additional models to be configured and increase the compute requirements and ingestion latency. This can be done using:
For the Gradio UI, the Enable Audio checkbox, Enable CV checkbox and the CV Pipeline Prompt input
For the reference CLI, configure as part of Summarization Command
For the REST API, configure as part of /summarize [POST] endpoint
As soon as the ingestion pipeline generates captions for a chunk, the captions are passed to the retrieval pipeline along with each chunk’s metadata for indexing and summarization.
If alerts are enabled, the retrieval pipeline will detect any configured events with the help of the configured LLM model. This will result in higher load for the LLM model. Alerts can be configured via:
For the Gradio UI, go to the Alerts tab
For the reference CLI, configure as part of Summarization Command or use Add Live Stream Alert (Live Streams only)
For the REST API, configure as part of /summarize [POST] endpoint or use /alerts [POST] endpoint (Live Streams only)
If chat is enabled, the retrieval pipeline with the help of the configured LLM and embedding models will ingest the VLM captions into the vector DB and/or the graph DB for Q&A. Various
Q&A related parameters such as enable_chat
, rag_batch_size
, rag_top_k
, chat_max_tokens
, chat_temperature
, chat_top_p
can be configured.
Using graph-ingestion
and graph-retrieval
(default) will result in higher Q&A accuracy at the cost of higher LLM load and higher latency.
Various chat related parameters can be configured using:
For the Gradio UI, the Enable Chat and Enable Chat History checkboxes in addition to the various parameters in the Parameters dialog.
For the reference CLI, use various chat related arguments with Summarization Command command
For the REST API, use various chat related parameters with /summarize [POST] endpoint
Summarization parameters can also be configured. These include summary_duration
(SUMMARY DURATION in UI), LLM summarization prompts caption_summarization_prompt
and summary_aggregation_prompt
and LLM sampling parameters
summarize_batch_size
, summarize_max_tokens
, summarize_temperature
, and summarize_top_p
. Refer to Tuning Prompts for more details on the prompts. For files, entire file is summarized at after.
For live streams, summaries are generated every summary_duration
seconds.
Various summarization related parameters can be configured using:
For the Gradio UI, the Parameters dialog and the SUMMARY DURATION input (Live Streams only)
For the reference CLI, use various summarization related arguments with Summarization Command command
For the REST API, use various summarization related parameters with /summarize [POST] endpoint
After all the parameters are configured, the summarize API can be called to start the summarization process.
For the Gradio UI, click on the Summarize button
For the reference CLI, use Summarization Command command
For the REST API, use /summarize [POST] endpoint
VLM Captions generation#
As an alternative to the full summarization pipeline, you can use VLM Captions Generation to generate detailed captions for video files or live streams without the overhead of summarization and Q&A capabilities. This is useful when you only need raw captions and don’t require the additional processing for alerts, chat, or summarization.
VLM captions generation can be performed using:
For the reference CLI, use Generate VLM Captions command
For the REST API, use /generate_vlm_captions [POST] endpoint
For live streams, VLM captions are generated in real-time and can be streamed using server-sent events. The captions provide detailed descriptions of each chunk of the video without the additional processing required for summarization.
Chat - Q&A#
If chat is enabled as part of summarization, Q&A can be performed after file summarization is complete or for live streams after at least one summary is generated. For live-streams, information about newer data from the live stream will get added to the graph or vector DB with as the summaries are generated.
Q&A can be performed using:
For the Gradio UI, the Chat tab
For the reference CLI, use Chat Command command
For the REST API, use /chat/completions [POST] endpoint
Alert Review#
VSS provides additional capabilities for alert review.
Alert review allows you to review external alerts using VLM analysis to generate a summary of the alert and determine if the alert is valid based on video content.
Alert review can be performed using:
For the reference CLI, use Review Alert command
For the REST API, use /reviewAlert [POST] endpoint
Clean Up#
After the use of a stream is over, it can be deleted using:
For the Gradio UI, click on the Delete button
For the reference CLI, use Delete File or Delete Live Stream command
For the REST API, use /files/{file_id} [DELETE] or /live-stream/{stream_id} [DELETE] endpoint
It is recommended to delete the file or live-stream after the use is over to free up the resources and prevent the graph DB from growing too large.
Multiple Streams and Concurrent Requests#
VSS supports multiple streams and concurrent requests. Use cases include processing multiple short video files or analyzing and processing multiple live camera feeds.
Clients can call any of the APIs including Summarization (POST /summarize) and Q&A (POST /chat/completions) in parallel for different files or live streams from different threads or processes. Clients do not have to worry about queuing and synchronization because the VSS backend will take care of queuing and scheduling the requests. VSS with the help of CA-RAG will also be responsible for maintaining contexts for each of the sources.
For more details refer to: