Metadata has always been one of the essential ingredients to help customers find something to watch. It varies from the TV Guide with high-level descriptions to the extended descriptions of recommended items.
These days we have machine learning driven content discovery experiences and clip-based navigational capabilities which rely on the availability of descriptive and semantically meaningful metadata.
Unfortunately, the availability of the metadata is limited due to the high cost of creating such metadata which require significant amount of human supervision. We show how automatic content analysis combines video, audio and text processing with machine learning algorithms to identify relevant moments or temporal segments and their descriptions without or with very limited human interaction.