- Using pyscenedetect to split each video on a per scene level
- Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)
- Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)
- Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.
I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.