Semantic Montage Engine

Multi-Modal AI Pipeline for Automated Video Editing

A production-grade system that transforms raw video footage into import-ready NLE timelines using vision-language models (VLMs), automatic speech recognition (ASR), and large language model (LLM) reasoning. The pipeline performs semantic analysis of visual content, transcribes multilingual speech with word-level alignment, and uses chain-of-thought reasoning to generate contextually-aware edit sequences.

Key Features

Feature	Description
Automated Timeline Generation	Raw videos → FCPXML ready for DaVinci Resolve
Vector Video Database	Semantic search across all processed footage
Multi-Language ASR	Whisper-powered transcription (EN/HI/TE)
Visual Understanding	Qwen2-VL scene analysis with object/activity detection
LLM Reasoning	DeepSeek-R1 for context-aware sequencing
33:1 Compression	Hierarchical metadata compression for LLM context limits

Abstract

Traditional video editing requires significant manual effort to review footage, identify usable segments, and arrange clips into coherent sequences. This project presents an end-to-end automated pipeline that:

Extracts multi-modal features from raw video using Whisper ASR and Qwen2-VL vision encoder
Compresses metadata through hierarchical summarization (97% token reduction) for LLM context limits
Generates edit decisions via DeepSeek-R1 reasoning model with script-awareness
Exports frame-accurate timelines in FCPXML 1.9 format for DaVinci Resolve
Enables semantic search via vector embeddings over all processed content

The system achieves 33:1 compression ratio on metadata while preserving semantic fidelity, enabling processing of 15-20 video projects within a single LLM context window (128K tokens).

Vector Video Database (Semantic Search)

Search your entire video library using natural language queries:

# Find clips with outdoor scenes
python src/main.py --input ./videos --search "outdoor balcony"

# Find clips with specific content
python src/main.py --input ./videos --search "person explaining something"

# Filter by content type
python src/main.py --input ./videos --search "talking" --search-type transcript
python src/main.py --input ./videos --search "walking" --search-type visual

How it works:

All processed metadata (transcripts, visual tags, summaries) is indexed using all-MiniLM-L6-v2 embeddings
Queries are matched using cosine similarity in 384-dimensional vector space
Returns ranked results with similarity scores (0.0–1.0)

Demo results (query: "person explaining"):

Rank	Video	Score	Matched Content
1	20260107_181214	0.407	"man gesturing with hands as if speaking"
2	20260107_181231	0.396	"man gesturing with hands as if speaking"
9	20260107_180354	0.341	transcript: "AI understands that the..."

Search modes:

Semantic (default): Understands meaning—"person explaining" matches "man gesturing"
Keyword (fallback): Fast text matching when embeddings unavailable

Use Cases

Domain	Application	Input	Output
Content Creation	YouTube/vlog automation	Talking-head + B-roll footage	Sequenced timeline with A/B-roll separation
Education	Lecture video assembly	Multi-take recordings + script	Script-aligned timeline with jump cuts removed
Documentary	Footage organization	Hours of unstructured clips	Thematically grouped sequences
Corporate	Training video production	Interview clips + bullet points	Narrative-ordered compilation
Multilingual	Code-switching content	Hindi-English mixed speech	Accurately transcribed, properly sequenced

Specific Scenarios

Vlogger Workflow: 20 short clips (2-30s each) → Single coherent video with automatic B-roll placement
Lecture Capture: Multiple takes of same content → Best take selection, false-start removal
Interview Editing: Long-form conversation → Topic-grouped segments with visual cutaways
Product Demo: Screen recordings + voiceover → Synchronized tutorial sequence

Technical Architecture

Pipeline Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           INGESTION LAYER                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 1] Video Ingestion                                                   │
│  ├── FFmpeg demuxing (H.264/H.265/VP9 → PCM audio)                         │
│  ├── Metadata extraction (duration, resolution, framerate)                  │
│  └── File hashing (SHA-256 for deduplication)                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                           FEATURE EXTRACTION                                │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 2] Speech Recognition                                                │
│  ├── OpenAI Whisper (base: 74M params, medium: 769M params)                │
│  ├── Word-level forced alignment via stable-ts                             │
│  └── Language detection (en/hi/te with confidence scores)                  │
│                                                                             │
│  [Step 3] Visual Analysis                                                   │
│  ├── Keyframe extraction (scene-change detection, 1-3s intervals)          │
│  ├── Qwen2-VL-2B-Instruct (2B params, BFloat16 inference)                  │
│  └── Semantic tagging (activities, objects, composition, emotions)         │
├─────────────────────────────────────────────────────────────────────────────┤
│                           COMPRESSION LAYER                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 4]  Temporal Fusion     → Audio-visual timeline merge               │
│  [Step 4b] Lossless Compress   → Duplicate removal, field pruning          │
│  [Step 4c] LLM Summarization   → Narrative extraction via DeepSeek-R1      │
│  [Step 4d] Transcript Extract  → Dialogue-only filtering                   │
│  [Step 4e] Digest Creation     → Ultra-compressed (keywords + durations)   │
│                                                                             │
│  Compression: 785 KB → 23 KB (97% reduction, 33:1 ratio)                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                           SEQUENCING LAYER                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 5a] LLM Reasoning                                                    │
│  ├── DeepSeek-R1 (671B MoE, chain-of-thought reasoning)                    │
│  ├── Script alignment (if provided)                                        │
│  ├── Dual-stage roll classification (A-roll/B-roll/mixed)                  │
│  └── Edit Decision List (EDL) generation                                   │
│                                                                             │
│  [Step 5b] Timeline Export                                                  │
│  ├── FCPXML 1.9 generation (rational time format: frames/framerate)        │
│  ├── Frame-accurate timecodes (1001/30000s precision for 29.97fps)         │
│  └── DaVinci Resolve 18+ compatible                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                           SEARCH LAYER                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 6] Vector Video Database                                             │
│  ├── Index all processed metadata (transcripts, visual, summaries)         │
│  ├── Sentence-transformers embeddings (all-MiniLM-L6-v2)                   │
│  └── Natural language semantic search                                      │
└─────────────────────────────────────────────────────────────────────────────┘

Model Stack

Component	Model	Parameters	Precision	VRAM	Inference
ASR	Whisper base	74M	FP32	1 GB	Local CPU/GPU
ASR	Whisper medium	769M	FP16	3 GB	Local GPU
VLM	Qwen2-VL-2B-Instruct	2B	BF16	6 GB	Local GPU
LLM	DeepSeek-R1	671B (MoE)	—	Cloud	NVIDIA API
LLM (alt)	GPT-4o	~200B (est.)	—	Cloud	OpenAI API
Embeddings	all-MiniLM-L6-v2	22M	FP32	0.1 GB	Local CPU

Key Innovations

1. Hierarchical Token Compression

Problem: Raw metadata (35KB/video) exceeds LLM context limits
Solution: Progressive compression through summarization stages
Result: 15-20 videos fit in single 128K context window

2. Visual-Primary Timeline

Traditional NLEs use audio waveforms as primary timeline
Our approach: Keyframe semantic embeddings as primary, speech as overlay
Enables vision-first sequencing decisions

3. Dual-Stage Roll Classification

Stage 1: Per-video analysis → Initial A/B-roll classification
Stage 2: Cross-video context → Script-aware refinement
Result: Dynamic track assignment based on narrative context

4. Zero-Loss Source Preservation

Full metadata preserved at Step 4 (source of truth)
All compressions (4b-4e) are derived transformations
Original timeline always available for reconstruction

Quick Start

# 1. Setup environment (one-time)
.\setup.ps1                    # Windows (PowerShell)

# 2. Activate virtual environment
.\activate.ps1                 # Windows

# 3. Run pipeline
python src/main.py --input ./videos --script ./script.txt

# 4. Import output to DaVinci Resolve
# File → Import → Timeline → .AiEditor/project_timeline.fcpxml

System Requirements

Component	Minimum	Recommended
GPU	NVIDIA GTX 1060 (6GB)	RTX 3060+ (8GB+)
RAM	8 GB	16 GB
CUDA	11.8	12.x
Python	3.8	3.10+
Disk	20 GB	50 GB (for model cache)

CPU-only mode: Supported with --device cpu (3-5x slower)

Installation

# Clone repository
git clone <repo-url> && cd AiEditor

# Run setup script (creates venv, installs dependencies)
.\setup.ps1

# Configure API keys (optional, for cloud LLM)
$env:NVIDIA_API_KEY = "nvapi-xxx"   # For DeepSeek-R1
$env:OPENAI_API_KEY = "sk-xxx"      # For GPT-4o

Dependencies

Core packages installed via requirements.txt:

torch>=2.0 (CUDA 11.8+ wheels)
transformers>=4.40 (Qwen2-VL support)
openai-whisper (ASR)
opencv-python (video I/O)
openai>=1.0 (unified LLM client)
rich (CLI interface)

Full installation guide: INSTALLATION.md

Usage

Basic Commands

# Full pipeline with script guidance
python src/main.py --input ./videos --script ./script.txt

# Auto-sequence without script
python src/main.py --input ./videos

# Specific steps only
python src/main.py --input ./videos --step 2    # Transcription only
python src/main.py --input ./videos --step 3    # Visual analysis only
python src/main.py --input ./videos --step 5b   # Regenerate XML only

Advanced Options

# Whisper model selection (base/small/medium/large)
python src/main.py --input ./videos --whisper-model medium

# Force language (en/hi/te/auto)
python src/main.py --input ./videos --language en

# GPU selection
$env:CUDA_VISIBLE_DEVICES = "0"
python src/main.py --input ./videos

# Alternative LLM provider
python src/main.py --input ./videos --llm-model gpt-4o

# Verbose logging
python src/main.py --input ./videos --verbose

Output Structure

input_folder/
├── .AiEditor/                              # Pipeline outputs
│   ├── step1_summary.json                  # Ingestion metadata
│   ├── step2_summary.json                  # Transcription summary
│   ├── step3_summary.json                  # Visual analysis summary
│   ├── audio/                              # Extracted PCM audio
│   │   └── {video_id}.wav
│   ├── transcriptions/                     # Whisper outputs
│   │   └── {video_id}_transcription.json   # Word-level timestamps
│   ├── keyframes/                          # Extracted frames
│   │   └── {video_id}/                     # 1 folder per video
│   ├── visual_tags/                        # VLM analysis
│   │   └── {video_id}_tags.json            # Scene descriptions
│   ├── final_metadata/                     # Full timeline (~35KB/video)
│   ├── compressed_metadata/                # Lossless compression (~7KB)
│   ├── llm_summaries/                      # DeepSeek summaries (~2KB)
│   ├── digest_metadata/                    # Ultra-compressed (~1KB)
│   ├── edl/                                # Edit decisions
│   │   └── edit_decision_list.json
│   └── project_timeline.fcpxml             # ← IMPORT THIS
└── [original videos unchanged]

Performance

Tested on: 19 videos, 343 seconds, RTX 3060 12GB

Step	Time	Notes
1-2	2-3 min	Audio + transcription
3	8-12 min	Visual analysis (bottleneck)
4-5b	~1 min	Compression + LLM + export
Total	~15 min	End-to-end

Compression: 785 KB → 23 KB (97% reduction, 33:1 ratio)

Documentation

Document	Description
INSTALLATION.md	Setup instructions
PIPELINE.md	Pipeline documentation
ARCHITECTURE.md	System diagrams
USAGE.md	CLI reference

Limitations

GPU: Qwen2-VL requires 6GB VRAM minimum
LLM: Cloud API needed for best sequencing
Audio: Low-quality degrades ASR accuracy

Future Work

B-roll matching via CLIP similarity
Background music with auto-ducking
Multi-track output (V1, V2, A1, A2)

License

MIT License - See LICENSE

IIT Ropar | End Semester Project | January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
INSTALLATION.md		INSTALLATION.md
LICENSE		LICENSE
PIPELINE.md		PIPELINE.md
README.md		README.md
STATUS.md		STATUS.md
SUBMISSION_README.md		SUBMISSION_README.md
SemanticMontageEngine.ipynb		SemanticMontageEngine.ipynb
USAGE.md		USAGE.md
VERSION		VERSION
activate.ps1		activate.ps1
requirements.txt		requirements.txt
run.py		run.py
setup.ps1		setup.ps1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Montage Engine

Key Features

Abstract

Vector Video Database (Semantic Search)

Use Cases

Specific Scenarios

Technical Architecture

Pipeline Overview

Model Stack

Key Innovations

Quick Start

System Requirements

Installation

Dependencies

Usage

Basic Commands

Advanced Options

Output Structure

Performance

Documentation

Limitations

Future Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Semantic Montage Engine

Key Features

Abstract

Vector Video Database (Semantic Search)

Use Cases

Specific Scenarios

Technical Architecture

Pipeline Overview

Model Stack

Key Innovations

Quick Start

System Requirements

Installation

Dependencies

Usage

Basic Commands

Advanced Options

Output Structure

Performance

Documentation

Limitations

Future Work

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages