libdav1d Caching Strategies for Memory Bandwidth

This article explores the specific caching strategies and memory bandwidth optimizations employed by the libdav1d AV1 decoder to achieve high-performance video playback. We will examine how libdav1d minimizes CPU-to-system-memory bottlenecks by utilizing cache-aligned memory allocation, localizing multi-threaded workloads, and pipelining post-processing filters directly within the CPU cache.

Cache-Aligned Memory Allocation

To prevent performance degradation caused by cache line splits, libdav1d strictly aligns its internal memory buffers. System memory is allocated with 32-byte or 64-byte alignments, matching the standard cache line sizes of modern x86 and ARM processors.

This alignment ensures that video pixel data (strides) and internal structural metadata map perfectly to CPU cache boundaries. By avoiding data structures that cross cache line boundaries, the decoder eliminates the need for the CPU to perform dual-cache line fetches for single data reads, significantly reducing memory bus pressure.

Pipelined Loop Filtering and Reconstruction

In traditional video decoders, a frame is decoded entirely, written back to the system RAM, and then read back into the CPU cache to undergo post-processing filters like deblocking, Constrained Directional Enhancement Filtering (CDEF), and Loop Restoration. This round-trip to the main system memory is highly inefficient.

libdav1d avoids this bottleneck by employing a pipelined, row-based (or strip-based) reconstruction process. As soon as a row of superblocks is decoded, the post-processing filters are applied immediately while the pixel data is still hot in the CPU’s L1 or L2 cache. Only the final, fully filtered pixels are written back to the main memory, drastically reducing the required memory bandwidth.

Cache-Aware Multi-Threading

The decoder utilizes a hybrid threading model consisting of both frame-level and tile/row-level parallelism. To prevent CPU cores from fighting over the same memory spaces—a phenomenon known as cache thrashing—libdav1d schedules its worker threads to maintain high data locality.

Threads working on adjacent parts of a frame are ideally scheduled on CPU cores that share the same L2 or L3 cache. This synchronization ensures that when one thread requires reference pixel data generated by another thread, the data can be retrieved directly from the shared on-chip cache rather than requiring an expensive fetch from the system DDR memory.

SIMD Register-Level Caching

At the lowest level, libdav1d is highly optimized with hand-written assembly (using AVX2, AVX-512, and ARM NEON instruction sets). The assembly routines are structured to maximize register-level reuse. By keeping active coefficients, residual data, and reference pixels loaded within CPU vector registers for as long as possible during the IDCT (Inverse Discrete Cosine Transform) and prediction phases, the decoder minimizes the frequency of L1 cache reads and writes, freeing up internal cache bandwidth for other critical decoding operations.