Multi-modal Sports Video Summarization

Built a multi-modal pipeline using BERT and CNNs to detect and align important football moments from commentary and scoreboard video. Final output is a stitched highlight video.
- Fine-tuned BERT for audio commentary classification.
- Used image processing on scoreboard to extract goal/match events.
- Weighted timestamp fusion for final highlight stitching.
- UI built using Streamlit for interactive demo.
Here’s the Architecture of the system: