Many books and articles have been written about artificial intelligence (AI) and machine learning (ML), in a variety of applications. ML is far from new, has an established theoretical foundation, and lots of different types of ML techniques have been introduced over the past decades. These techniques can be classified in different ways, but a full taxonomy is outside of the scope of this article. In this paper, we focus mostly on ML algorithms, as a subset of AI. Excellent introductions and overviews have been provided in e.g. [Goodfellow16, Bishop95].
Lots of successes have been claimed based on ML, and reports of AI intelligence are already the subject of ethical discussions [Google22]. Still, the powers of machine learning are not always a solution, and in many applications, even though they make for an interesting marketing statement, they do not lead to net gains or operational savings.
Machine learning has powerful applications in computer vision, image and video processing, and approaches using deep neural networks have become the center of academic and industry research. For example, residual neural networks have shown impressive results for image classification and recognition [Simonyan14, He16]. Still in most of these cases, very complex algorithms are needed, requiring e.g. deep neural networks containing dozens or hundreds of layers. While it’s acceptable to have a very complex training stage (which needs to be executed once), it’s primarily the complexity of the inference network (which needs to be repeated many times) that determines the feasibility of ML approaches. An important unit of expressing the complexity of ML inference networks is the number of multiply-accumulate operations (MACs). Some of the best-performing image recognition networks use millions of MACs per image.
Often, new approaches are deemed feasible when they can be run on state-of-the-art GPUs inside a server. In certain cases, this is acceptable, and the cost of a dedicated CPU or GPU is warranted. For real-time, cost-sensitive applications, however, this is not an option. In typical video encoding/transcoding set-ups, dozens or even hundreds of channels need to be processed on a single server, and the cost per channel is a crucial criterion. Furthermore, the latency of offloading decisions to accelerators (if they would be cost effective, which is not the case), would be prohibitive.
In this paper, we discuss the applicability of machine learning approaches in different areas of real-time video compression. We successively cover encoder complexity reduction, rate control, video quality improvements and video quality measurement. In each of these areas, we have studied ways to reduce the complexity of ML inference, to end up with algorithms that are applicable in real-time, cost-sensitive applications.