Multi-Modal AI Pipeline for Automated Video Editing
A production-grade system that transforms raw video footage into import-ready NLE timelines using vision-language models (VLMs), automatic speech recognition (ASR), and large language model (LLM) reasoning. The pipeline performs semantic analysis of visual content, transcribes multilingual speech with word-level alignment, and uses chain-of-thought reasoning to generate contextually-aware edit sequences.
| Feature | Description |
|---|---|
| Automated Timeline Generation | Raw videos → FCPXML ready for DaVinci Resolve |
| Vector Video Database | Semantic search across all processed footage |
| Multi-Language ASR | Whisper-powered transcription (EN/HI/TE) |
| Visual Understanding | Qwen2-VL scene analysis with object/activity detection |
| LLM Reasoning | DeepSeek-R1 for context-aware sequencing |
| 33:1 Compression | Hierarchical metadata compression for LLM context limits |
Traditional video editing requires significant manual effort to review footage, identify usable segments, and arrange clips into coherent sequences. This project presents an end-to-end automated pipeline that:
- Extracts multi-modal features from raw video using Whisper ASR and Qwen2-VL vision encoder
- Compresses metadata through hierarchical summarization (97% token reduction) for LLM context limits
- Generates edit decisions via DeepSeek-R1 reasoning model with script-awareness
- Exports frame-accurate timelines in FCPXML 1.9 format for DaVinci Resolve
- Enables semantic search via vector embeddings over all processed content
The system achieves 33:1 compression ratio on metadata while preserving semantic fidelity, enabling processing of 15-20 video projects within a single LLM context window (128K tokens).
Search your entire video library using natural language queries:
# Find clips with outdoor scenes
python src/main.py --input ./videos --search "outdoor balcony"
# Find clips with specific content
python src/main.py --input ./videos --search "person explaining something"
# Filter by content type
python src/main.py --input ./videos --search "talking" --search-type transcript
python src/main.py --input ./videos --search "walking" --search-type visualHow it works:
- All processed metadata (transcripts, visual tags, summaries) is indexed using
all-MiniLM-L6-v2embeddings - Queries are matched using cosine similarity in 384-dimensional vector space
- Returns ranked results with similarity scores (0.0–1.0)
Demo results (query: "person explaining"):
| Rank | Video | Score | Matched Content |
|---|---|---|---|
| 1 | 20260107_181214 | 0.407 | "man gesturing with hands as if speaking" |
| 2 | 20260107_181231 | 0.396 | "man gesturing with hands as if speaking" |
| 9 | 20260107_180354 | 0.341 | transcript: "AI understands that the..." |
Search modes:
- Semantic (default): Understands meaning—"person explaining" matches "man gesturing"
- Keyword (fallback): Fast text matching when embeddings unavailable
| Domain | Application | Input | Output |
|---|---|---|---|
| Content Creation | YouTube/vlog automation | Talking-head + B-roll footage | Sequenced timeline with A/B-roll separation |
| Education | Lecture video assembly | Multi-take recordings + script | Script-aligned timeline with jump cuts removed |
| Documentary | Footage organization | Hours of unstructured clips | Thematically grouped sequences |
| Corporate | Training video production | Interview clips + bullet points | Narrative-ordered compilation |
| Multilingual | Code-switching content | Hindi-English mixed speech | Accurately transcribed, properly sequenced |
- Vlogger Workflow: 20 short clips (2-30s each) → Single coherent video with automatic B-roll placement
- Lecture Capture: Multiple takes of same content → Best take selection, false-start removal
- Interview Editing: Long-form conversation → Topic-grouped segments with visual cutaways
- Product Demo: Screen recordings + voiceover → Synchronized tutorial sequence
┌─────────────────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ [Step 1] Video Ingestion │
│ ├── FFmpeg demuxing (H.264/H.265/VP9 → PCM audio) │
│ ├── Metadata extraction (duration, resolution, framerate) │
│ └── File hashing (SHA-256 for deduplication) │
├─────────────────────────────────────────────────────────────────────────────┤
│ FEATURE EXTRACTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ [Step 2] Speech Recognition │
│ ├── OpenAI Whisper (base: 74M params, medium: 769M params) │
│ ├── Word-level forced alignment via stable-ts │
│ └── Language detection (en/hi/te with confidence scores) │
│ │
│ [Step 3] Visual Analysis │
│ ├── Keyframe extraction (scene-change detection, 1-3s intervals) │
│ ├── Qwen2-VL-2B-Instruct (2B params, BFloat16 inference) │
│ └── Semantic tagging (activities, objects, composition, emotions) │
├─────────────────────────────────────────────────────────────────────────────┤
│ COMPRESSION LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ [Step 4] Temporal Fusion → Audio-visual timeline merge │
│ [Step 4b] Lossless Compress → Duplicate removal, field pruning │
│ [Step 4c] LLM Summarization → Narrative extraction via DeepSeek-R1 │
│ [Step 4d] Transcript Extract → Dialogue-only filtering │
│ [Step 4e] Digest Creation → Ultra-compressed (keywords + durations) │
│ │
│ Compression: 785 KB → 23 KB (97% reduction, 33:1 ratio) │
├─────────────────────────────────────────────────────────────────────────────┤
│ SEQUENCING LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ [Step 5a] LLM Reasoning │
│ ├── DeepSeek-R1 (671B MoE, chain-of-thought reasoning) │
│ ├── Script alignment (if provided) │
│ ├── Dual-stage roll classification (A-roll/B-roll/mixed) │
│ └── Edit Decision List (EDL) generation │
│ │
│ [Step 5b] Timeline Export │
│ ├── FCPXML 1.9 generation (rational time format: frames/framerate) │
│ ├── Frame-accurate timecodes (1001/30000s precision for 29.97fps) │
│ └── DaVinci Resolve 18+ compatible │
├─────────────────────────────────────────────────────────────────────────────┤
│ SEARCH LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ [Step 6] Vector Video Database │
│ ├── Index all processed metadata (transcripts, visual, summaries) │
│ ├── Sentence-transformers embeddings (all-MiniLM-L6-v2) │
│ └── Natural language semantic search │
└─────────────────────────────────────────────────────────────────────────────┘
| Component | Model | Parameters | Precision | VRAM | Inference |
|---|---|---|---|---|---|
| ASR | Whisper base | 74M | FP32 | 1 GB | Local CPU/GPU |
| ASR | Whisper medium | 769M | FP16 | 3 GB | Local GPU |
| VLM | Qwen2-VL-2B-Instruct | 2B | BF16 | 6 GB | Local GPU |
| LLM | DeepSeek-R1 | 671B (MoE) | — | Cloud | NVIDIA API |
| LLM (alt) | GPT-4o | ~200B (est.) | — | Cloud | OpenAI API |
| Embeddings | all-MiniLM-L6-v2 | 22M | FP32 | 0.1 GB | Local CPU |
1. Hierarchical Token Compression
- Problem: Raw metadata (35KB/video) exceeds LLM context limits
- Solution: Progressive compression through summarization stages
- Result: 15-20 videos fit in single 128K context window
2. Visual-Primary Timeline
- Traditional NLEs use audio waveforms as primary timeline
- Our approach: Keyframe semantic embeddings as primary, speech as overlay
- Enables vision-first sequencing decisions
3. Dual-Stage Roll Classification
- Stage 1: Per-video analysis → Initial A/B-roll classification
- Stage 2: Cross-video context → Script-aware refinement
- Result: Dynamic track assignment based on narrative context
4. Zero-Loss Source Preservation
- Full metadata preserved at Step 4 (source of truth)
- All compressions (4b-4e) are derived transformations
- Original timeline always available for reconstruction
# 1. Setup environment (one-time)
.\setup.ps1 # Windows (PowerShell)
# 2. Activate virtual environment
.\activate.ps1 # Windows
# 3. Run pipeline
python src/main.py --input ./videos --script ./script.txt
# 4. Import output to DaVinci Resolve
# File → Import → Timeline → .AiEditor/project_timeline.fcpxml| Component | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA GTX 1060 (6GB) | RTX 3060+ (8GB+) |
| RAM | 8 GB | 16 GB |
| CUDA | 11.8 | 12.x |
| Python | 3.8 | 3.10+ |
| Disk | 20 GB | 50 GB (for model cache) |
CPU-only mode: Supported with --device cpu (3-5x slower)
# Clone repository
git clone <repo-url> && cd AiEditor
# Run setup script (creates venv, installs dependencies)
.\setup.ps1
# Configure API keys (optional, for cloud LLM)
$env:NVIDIA_API_KEY = "nvapi-xxx" # For DeepSeek-R1
$env:OPENAI_API_KEY = "sk-xxx" # For GPT-4oCore packages installed via requirements.txt:
torch>=2.0(CUDA 11.8+ wheels)transformers>=4.40(Qwen2-VL support)openai-whisper(ASR)opencv-python(video I/O)openai>=1.0(unified LLM client)rich(CLI interface)
Full installation guide: INSTALLATION.md
# Full pipeline with script guidance
python src/main.py --input ./videos --script ./script.txt
# Auto-sequence without script
python src/main.py --input ./videos
# Specific steps only
python src/main.py --input ./videos --step 2 # Transcription only
python src/main.py --input ./videos --step 3 # Visual analysis only
python src/main.py --input ./videos --step 5b # Regenerate XML only# Whisper model selection (base/small/medium/large)
python src/main.py --input ./videos --whisper-model medium
# Force language (en/hi/te/auto)
python src/main.py --input ./videos --language en
# GPU selection
$env:CUDA_VISIBLE_DEVICES = "0"
python src/main.py --input ./videos
# Alternative LLM provider
python src/main.py --input ./videos --llm-model gpt-4o
# Verbose logging
python src/main.py --input ./videos --verboseinput_folder/
├── .AiEditor/ # Pipeline outputs
│ ├── step1_summary.json # Ingestion metadata
│ ├── step2_summary.json # Transcription summary
│ ├── step3_summary.json # Visual analysis summary
│ ├── audio/ # Extracted PCM audio
│ │ └── {video_id}.wav
│ ├── transcriptions/ # Whisper outputs
│ │ └── {video_id}_transcription.json # Word-level timestamps
│ ├── keyframes/ # Extracted frames
│ │ └── {video_id}/ # 1 folder per video
│ ├── visual_tags/ # VLM analysis
│ │ └── {video_id}_tags.json # Scene descriptions
│ ├── final_metadata/ # Full timeline (~35KB/video)
│ ├── compressed_metadata/ # Lossless compression (~7KB)
│ ├── llm_summaries/ # DeepSeek summaries (~2KB)
│ ├── digest_metadata/ # Ultra-compressed (~1KB)
│ ├── edl/ # Edit decisions
│ │ └── edit_decision_list.json
│ └── project_timeline.fcpxml # ← IMPORT THIS
└── [original videos unchanged]
Tested on: 19 videos, 343 seconds, RTX 3060 12GB
| Step | Time | Notes |
|---|---|---|
| 1-2 | 2-3 min | Audio + transcription |
| 3 | 8-12 min | Visual analysis (bottleneck) |
| 4-5b | ~1 min | Compression + LLM + export |
| Total | ~15 min | End-to-end |
Compression: 785 KB → 23 KB (97% reduction, 33:1 ratio)
| Document | Description |
|---|---|
| INSTALLATION.md | Setup instructions |
| PIPELINE.md | Pipeline documentation |
| ARCHITECTURE.md | System diagrams |
| USAGE.md | CLI reference |
- GPU: Qwen2-VL requires 6GB VRAM minimum
- LLM: Cloud API needed for best sequencing
- Audio: Low-quality degrades ASR accuracy
- B-roll matching via CLIP similarity
- Background music with auto-ducking
- Multi-track output (V1, V2, A1, A2)
MIT License - See LICENSE
IIT Ropar | End Semester Project | January 2026