libdav1d AV1 Scaling on 64+ Core CPU Servers

This article examines how the libdav1d AV1 decoder scales on high-end servers equipped with 64 or more CPU cores. We analyze libdav1d’s multi-threading architecture, explore why single-stream decoding struggles to utilize massive core counts, and explain how to achieve maximum hardware efficiency through multi-stream workloads and NUMA-aware system configurations.

The Threading Architecture of libdav1d

To understand how libdav1d performs on high-core-count processors (such as AMD EPYC or Intel Xeon Scalable), it is necessary to look at its two-tier threading model:

This combination allows libdav1d to distribute the decoding workload across dozens of threads. However, the architecture faces inherent limits when scaled to 64 or more physical cores.

The Single-Stream Scaling Ceiling

When decoding a single AV1 video stream, scaling peaks long before reaching 64 cores. In practice, a single 4K or 8K AV1 stream rarely benefits from more than 16 to 32 threads. Several factors cause this bottleneck:

Consequently, attempting to decode a single AV1 stream using all 64+ cores on a high-end server results in low per-core utilization and diminishing performance returns.

Achieving Peak Scaling with Multi-Stream Workloads

To fully utilize a server with 64 or more CPU cores, you must transition from single-stream decoding to parallel, multi-stream decoding.

In high-density media pipelines, streaming platforms, or CDN edge servers, running multiple independent instances of libdav1d scales almost linearly. Because there are no dependencies between different video streams, 64 or more instances of libdav1d can run concurrently without thread synchronization bottlenecks. This approach allows high-end servers to achieve near 100% CPU utilization and maximum aggregate decoding throughput (measured in total frames per second).

Managing NUMA and Memory Bottlenecks

High-end servers with 64 or more cores are typically designed around Non-Uniform Memory Access (NUMA) architectures. These systems group cores and memory into distinct NUMA nodes. If libdav1d is not configured with NUMA in mind, performance can degrade due to cross-node memory latency.

To optimize libdav1d on these platforms, implement the following deployment practices: