I’m looking for something that I could run locally and turn loose on a collection of videos to get a quick list of tags for each piece of content.

e.g. A video of a cat playing in a front yard on a sunny day would generate a collection of tags like ["Cat", "Grass", "Sidewalk", "Sunny", "Flower", "Dirt"] or a video of children playing on a playground would generate ["Child", "Slide", "Swing", "Seesaw", "Kids"]

There seem to be a number of online products that will do this sort of thing for YouTube videos or allow you to upload content to their cloud for analysis (and often for a decent price) but I don’t want to run everything through the internet as it seems like I’d spend more time uploading stuff than it’d be worth the bother.

It seems like OpenCV might be capable of doing something like this, but I haven’t found anyone speaking of its use without having to first train your own model which would probably reduce the effectiveness of this approach as I’d have to go tag all my own content first to teach the model how to do it?