Comment by laidoffamazon

laidoffamazon Dec 3, 2025 parent

It’s not difficult to hack this together with CLIP. I did this with about a tenth of my movie collection last week with a GTX 1080 - though it lacks temporal understanding so you have to do the scene analysis yourself

vhcr Dec 3, 2025

I'm guessing you're not storing the CLIP for every single frame, instead of every second or so? Also, are you using the cosine similarity? How are you finding the nearest vector?

laidoffamazon OP Dec 3, 2025

I split per scene using pyscenedetect and sampled from each. Distance is via cosine similarity- I fed it into qdrant

dynode Dec 3, 2025

Would you be willing to share more details of what you did?

laidoffamazon OP Dec 3, 2025

Sure. I had a lot of help from Claude Opus 4.5, but it was roughly:

- Using pyscenedetect to split each video on a per scene level

- Using the decord library https://github.com/dmlc/decord to pull frames from each scene at a particular sample rate (specific rate I don't have handy right now, but it was 1-2 per scene)

- Aggregating frames in batches of around 256 frames to be normalized for CLIP embedding on GPU (had to re-write the normalization process for this because the default library does it on CPU)

- Uploading the frames along with metadata (timestamp, etc) into a vector DB, in my case Qdrant running locally along with a screenclip of the frame itself for debugging.

I'm bottlenecked by GPU compute so I also started experimenting with using Modal for the embedding work too, but then vacation ended :) Might pick it up again in a few weeks. I'd like to be able to have a temporal-aware and potentially enriched search so that I can say "Seek to the scene in Oppenheimer where Rami Malek testifies" and be able to get a timestamped clip from the movie.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous