libdav1d AV1 Scaling on 64+ Core CPU Servers
This article examines how the libdav1d AV1 decoder scales on high-end servers equipped with 64 or more CPU cores. We analyze libdav1d’s multi-threading architecture, explore why single-stream decoding struggles to utilize massive core counts, and explain how to achieve maximum hardware efficiency through multi-stream workloads and NUMA-aware system configurations.
The Threading Architecture of libdav1d
To understand how libdav1d performs on high-core-count processors (such as AMD EPYC or Intel Xeon Scalable), it is necessary to look at its two-tier threading model:
- Frame Threading: libdav1d can decode multiple video frames in parallel. This is highly efficient but introduces latency and increases memory usage, as multiple reference frames must be kept in memory simultaneously.
- Tile/Row Threading: Within a single frame, libdav1d parallelizes the decoding of independent video tiles and rows of block-level data. This reduces latency but is constrained by the way the AV1 video was originally encoded.
This combination allows libdav1d to distribute the decoding workload across dozens of threads. However, the architecture faces inherent limits when scaled to 64 or more physical cores.
The Single-Stream Scaling Ceiling
When decoding a single AV1 video stream, scaling peaks long before reaching 64 cores. In practice, a single 4K or 8K AV1 stream rarely benefits from more than 16 to 32 threads. Several factors cause this bottleneck:
- Serial Dependencies: AV1 decoding involves sequential processes, such as context-adaptive binary arithmetic coding (entropy decoding) and loop filtering. These steps cannot be easily parallelized and create serialization bottlenecks.
- Tile Limitations: Tile-threading relies on the video file containing multiple tiles. If an AV1 stream was encoded with only 4 tiles, libdav1d cannot efficiently distribute the spatial decoding workload across 64 cores.
- Threading Overhead: As the thread count increases past 32, the CPU cycles spent on thread synchronization, context switching, and cache coherency begin to exceed the computational benefits of parallel processing.
Consequently, attempting to decode a single AV1 stream using all 64+ cores on a high-end server results in low per-core utilization and diminishing performance returns.
Achieving Peak Scaling with Multi-Stream Workloads
To fully utilize a server with 64 or more CPU cores, you must transition from single-stream decoding to parallel, multi-stream decoding.
In high-density media pipelines, streaming platforms, or CDN edge servers, running multiple independent instances of libdav1d scales almost linearly. Because there are no dependencies between different video streams, 64 or more instances of libdav1d can run concurrently without thread synchronization bottlenecks. This approach allows high-end servers to achieve near 100% CPU utilization and maximum aggregate decoding throughput (measured in total frames per second).
Managing NUMA and Memory Bottlenecks
High-end servers with 64 or more cores are typically designed around Non-Uniform Memory Access (NUMA) architectures. These systems group cores and memory into distinct NUMA nodes. If libdav1d is not configured with NUMA in mind, performance can degrade due to cross-node memory latency.
To optimize libdav1d on these platforms, implement the following deployment practices:
- Process Pinning: Use tools like
numactl(on Linux) to bind specific libdav1d decoding instances to specific NUMA nodes and local memory banks. - L3 Cache Locality: Keep the thread pool of any single libdav1d instance restricted to a single CPU socket or CCD (Core Complex Die) to minimize cache thrashing.
- Affinity Settings: Avoid letting the operating system scheduler freely move libdav1d threads across all 64+ cores, as this destroys cache locality and increases memory latency.