Skip to content

AHSharan/IIT-Ropar-End-SEM-Project

Repository files navigation

Semantic Montage Engine

Multi-Modal AI Pipeline for Automated Video Editing

A production-grade system that transforms raw video footage into import-ready NLE timelines using vision-language models (VLMs), automatic speech recognition (ASR), and large language model (LLM) reasoning. The pipeline performs semantic analysis of visual content, transcribes multilingual speech with word-level alignment, and uses chain-of-thought reasoning to generate contextually-aware edit sequences.

Python 3.8+ CUDA 11.8+ License: MIT


Key Features

Feature Description
Automated Timeline Generation Raw videos → FCPXML ready for DaVinci Resolve
Vector Video Database Semantic search across all processed footage
Multi-Language ASR Whisper-powered transcription (EN/HI/TE)
Visual Understanding Qwen2-VL scene analysis with object/activity detection
LLM Reasoning DeepSeek-R1 for context-aware sequencing
33:1 Compression Hierarchical metadata compression for LLM context limits

Abstract

Traditional video editing requires significant manual effort to review footage, identify usable segments, and arrange clips into coherent sequences. This project presents an end-to-end automated pipeline that:

  1. Extracts multi-modal features from raw video using Whisper ASR and Qwen2-VL vision encoder
  2. Compresses metadata through hierarchical summarization (97% token reduction) for LLM context limits
  3. Generates edit decisions via DeepSeek-R1 reasoning model with script-awareness
  4. Exports frame-accurate timelines in FCPXML 1.9 format for DaVinci Resolve
  5. Enables semantic search via vector embeddings over all processed content

The system achieves 33:1 compression ratio on metadata while preserving semantic fidelity, enabling processing of 15-20 video projects within a single LLM context window (128K tokens).


Vector Video Database (Semantic Search)

Search your entire video library using natural language queries:

# Find clips with outdoor scenes
python src/main.py --input ./videos --search "outdoor balcony"

# Find clips with specific content
python src/main.py --input ./videos --search "person explaining something"

# Filter by content type
python src/main.py --input ./videos --search "talking" --search-type transcript
python src/main.py --input ./videos --search "walking" --search-type visual

How it works:

  1. All processed metadata (transcripts, visual tags, summaries) is indexed using all-MiniLM-L6-v2 embeddings
  2. Queries are matched using cosine similarity in 384-dimensional vector space
  3. Returns ranked results with similarity scores (0.0–1.0)

Demo results (query: "person explaining"):

Rank Video Score Matched Content
1 20260107_181214 0.407 "man gesturing with hands as if speaking"
2 20260107_181231 0.396 "man gesturing with hands as if speaking"
9 20260107_180354 0.341 transcript: "AI understands that the..."

Search modes:

  • Semantic (default): Understands meaning—"person explaining" matches "man gesturing"
  • Keyword (fallback): Fast text matching when embeddings unavailable

Use Cases

Domain Application Input Output
Content Creation YouTube/vlog automation Talking-head + B-roll footage Sequenced timeline with A/B-roll separation
Education Lecture video assembly Multi-take recordings + script Script-aligned timeline with jump cuts removed
Documentary Footage organization Hours of unstructured clips Thematically grouped sequences
Corporate Training video production Interview clips + bullet points Narrative-ordered compilation
Multilingual Code-switching content Hindi-English mixed speech Accurately transcribed, properly sequenced

Specific Scenarios

  • Vlogger Workflow: 20 short clips (2-30s each) → Single coherent video with automatic B-roll placement
  • Lecture Capture: Multiple takes of same content → Best take selection, false-start removal
  • Interview Editing: Long-form conversation → Topic-grouped segments with visual cutaways
  • Product Demo: Screen recordings + voiceover → Synchronized tutorial sequence

Technical Architecture

Pipeline Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│                           INGESTION LAYER                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 1] Video Ingestion                                                   │
│  ├── FFmpeg demuxing (H.264/H.265/VP9 → PCM audio)                         │
│  ├── Metadata extraction (duration, resolution, framerate)                  │
│  └── File hashing (SHA-256 for deduplication)                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                           FEATURE EXTRACTION                                │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 2] Speech Recognition                                                │
│  ├── OpenAI Whisper (base: 74M params, medium: 769M params)                │
│  ├── Word-level forced alignment via stable-ts                             │
│  └── Language detection (en/hi/te with confidence scores)                  │
│                                                                             │
│  [Step 3] Visual Analysis                                                   │
│  ├── Keyframe extraction (scene-change detection, 1-3s intervals)          │
│  ├── Qwen2-VL-2B-Instruct (2B params, BFloat16 inference)                  │
│  └── Semantic tagging (activities, objects, composition, emotions)         │
├─────────────────────────────────────────────────────────────────────────────┤
│                           COMPRESSION LAYER                                 │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 4]  Temporal Fusion     → Audio-visual timeline merge               │
│  [Step 4b] Lossless Compress   → Duplicate removal, field pruning          │
│  [Step 4c] LLM Summarization   → Narrative extraction via DeepSeek-R1      │
│  [Step 4d] Transcript Extract  → Dialogue-only filtering                   │
│  [Step 4e] Digest Creation     → Ultra-compressed (keywords + durations)   │
│                                                                             │
│  Compression: 785 KB → 23 KB (97% reduction, 33:1 ratio)                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                           SEQUENCING LAYER                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 5a] LLM Reasoning                                                    │
│  ├── DeepSeek-R1 (671B MoE, chain-of-thought reasoning)                    │
│  ├── Script alignment (if provided)                                        │
│  ├── Dual-stage roll classification (A-roll/B-roll/mixed)                  │
│  └── Edit Decision List (EDL) generation                                   │
│                                                                             │
│  [Step 5b] Timeline Export                                                  │
│  ├── FCPXML 1.9 generation (rational time format: frames/framerate)        │
│  ├── Frame-accurate timecodes (1001/30000s precision for 29.97fps)         │
│  └── DaVinci Resolve 18+ compatible                                        │
├─────────────────────────────────────────────────────────────────────────────┤
│                           SEARCH LAYER                                      │
├─────────────────────────────────────────────────────────────────────────────┤
│  [Step 6] Vector Video Database                                             │
│  ├── Index all processed metadata (transcripts, visual, summaries)         │
│  ├── Sentence-transformers embeddings (all-MiniLM-L6-v2)                   │
│  └── Natural language semantic search                                      │
└─────────────────────────────────────────────────────────────────────────────┘

Model Stack

Component Model Parameters Precision VRAM Inference
ASR Whisper base 74M FP32 1 GB Local CPU/GPU
ASR Whisper medium 769M FP16 3 GB Local GPU
VLM Qwen2-VL-2B-Instruct 2B BF16 6 GB Local GPU
LLM DeepSeek-R1 671B (MoE) Cloud NVIDIA API
LLM (alt) GPT-4o ~200B (est.) Cloud OpenAI API
Embeddings all-MiniLM-L6-v2 22M FP32 0.1 GB Local CPU

Key Innovations

1. Hierarchical Token Compression

  • Problem: Raw metadata (35KB/video) exceeds LLM context limits
  • Solution: Progressive compression through summarization stages
  • Result: 15-20 videos fit in single 128K context window

2. Visual-Primary Timeline

  • Traditional NLEs use audio waveforms as primary timeline
  • Our approach: Keyframe semantic embeddings as primary, speech as overlay
  • Enables vision-first sequencing decisions

3. Dual-Stage Roll Classification

  • Stage 1: Per-video analysis → Initial A/B-roll classification
  • Stage 2: Cross-video context → Script-aware refinement
  • Result: Dynamic track assignment based on narrative context

4. Zero-Loss Source Preservation

  • Full metadata preserved at Step 4 (source of truth)
  • All compressions (4b-4e) are derived transformations
  • Original timeline always available for reconstruction

Quick Start

# 1. Setup environment (one-time)
.\setup.ps1                    # Windows (PowerShell)

# 2. Activate virtual environment
.\activate.ps1                 # Windows

# 3. Run pipeline
python src/main.py --input ./videos --script ./script.txt

# 4. Import output to DaVinci Resolve
# File → Import → Timeline → .AiEditor/project_timeline.fcpxml

System Requirements

Component Minimum Recommended
GPU NVIDIA GTX 1060 (6GB) RTX 3060+ (8GB+)
RAM 8 GB 16 GB
CUDA 11.8 12.x
Python 3.8 3.10+
Disk 20 GB 50 GB (for model cache)

CPU-only mode: Supported with --device cpu (3-5x slower)


Installation

# Clone repository
git clone <repo-url> && cd AiEditor

# Run setup script (creates venv, installs dependencies)
.\setup.ps1

# Configure API keys (optional, for cloud LLM)
$env:NVIDIA_API_KEY = "nvapi-xxx"   # For DeepSeek-R1
$env:OPENAI_API_KEY = "sk-xxx"      # For GPT-4o

Dependencies

Core packages installed via requirements.txt:

  • torch>=2.0 (CUDA 11.8+ wheels)
  • transformers>=4.40 (Qwen2-VL support)
  • openai-whisper (ASR)
  • opencv-python (video I/O)
  • openai>=1.0 (unified LLM client)
  • rich (CLI interface)

Full installation guide: INSTALLATION.md


Usage

Basic Commands

# Full pipeline with script guidance
python src/main.py --input ./videos --script ./script.txt

# Auto-sequence without script
python src/main.py --input ./videos

# Specific steps only
python src/main.py --input ./videos --step 2    # Transcription only
python src/main.py --input ./videos --step 3    # Visual analysis only
python src/main.py --input ./videos --step 5b   # Regenerate XML only

Advanced Options

# Whisper model selection (base/small/medium/large)
python src/main.py --input ./videos --whisper-model medium

# Force language (en/hi/te/auto)
python src/main.py --input ./videos --language en

# GPU selection
$env:CUDA_VISIBLE_DEVICES = "0"
python src/main.py --input ./videos

# Alternative LLM provider
python src/main.py --input ./videos --llm-model gpt-4o

# Verbose logging
python src/main.py --input ./videos --verbose

Output Structure

input_folder/
├── .AiEditor/                              # Pipeline outputs
│   ├── step1_summary.json                  # Ingestion metadata
│   ├── step2_summary.json                  # Transcription summary
│   ├── step3_summary.json                  # Visual analysis summary
│   ├── audio/                              # Extracted PCM audio
│   │   └── {video_id}.wav
│   ├── transcriptions/                     # Whisper outputs
│   │   └── {video_id}_transcription.json   # Word-level timestamps
│   ├── keyframes/                          # Extracted frames
│   │   └── {video_id}/                     # 1 folder per video
│   ├── visual_tags/                        # VLM analysis
│   │   └── {video_id}_tags.json            # Scene descriptions
│   ├── final_metadata/                     # Full timeline (~35KB/video)
│   ├── compressed_metadata/                # Lossless compression (~7KB)
│   ├── llm_summaries/                      # DeepSeek summaries (~2KB)
│   ├── digest_metadata/                    # Ultra-compressed (~1KB)
│   ├── edl/                                # Edit decisions
│   │   └── edit_decision_list.json
│   └── project_timeline.fcpxml             # ← IMPORT THIS
└── [original videos unchanged]

Performance

Tested on: 19 videos, 343 seconds, RTX 3060 12GB

Step Time Notes
1-2 2-3 min Audio + transcription
3 8-12 min Visual analysis (bottleneck)
4-5b ~1 min Compression + LLM + export
Total ~15 min End-to-end

Compression: 785 KB → 23 KB (97% reduction, 33:1 ratio)


Documentation

Document Description
INSTALLATION.md Setup instructions
PIPELINE.md Pipeline documentation
ARCHITECTURE.md System diagrams
USAGE.md CLI reference

Limitations

  • GPU: Qwen2-VL requires 6GB VRAM minimum
  • LLM: Cloud API needed for best sequencing
  • Audio: Low-quality degrades ASR accuracy

Future Work

  • B-roll matching via CLIP similarity
  • Background music with auto-ducking
  • Multi-track output (V1, V2, A1, A2)

License

MIT License - See LICENSE


IIT Ropar | End Semester Project | January 2026

About

Multi-modal AI video editing pipeline. Whisper ASR + Qwen2-VL vision + DeepSeek-R1 reasoning → FCPXML timelines for DaVinci Resolve. Features 33:1 metadata compression, semantic vector search, and A-roll/B-roll classification. IIT Ropar Module E project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors