How libdav1d Optimizes AV1 Loop Filtering

This article analyzes how the open-source AV1 decoder, libdav1d, optimizes the computationally intensive in-loop filtering stages of the AV1 video codec. We explore the library’s advanced use of SIMD assembly, cache-friendly pipelining, and multi-threaded scheduling to accelerate the deblocking filter, Constrained Directional Enhancement Filter (CDEF), and Loop Restoration, achieving industry-leading decoding speeds.

The Challenge of AV1 In-Loop Filtering

AV1 uses three sequential loop filters applied to reconstructed frames: the Deblocking Filter (DBF), the Constrained Directional Enhancement Filter (CDEF), and the Loop Restoration (LR) filter. While these filters significantly improve visual quality and compression efficiency, they represent a massive computational bottleneck, often accounting for a large percentage of total decoding time. libdav1d tackles this bottleneck through hardware-specific and architectural optimizations.

Hand-Written SIMD Vectorization

The primary speedup in libdav1d comes from extensive, hand-written assembly code utilizing Single Instruction, Multiple Data (SIMD) instruction sets.

x86 Optimization (AVX2 and AVX-512): Highly optimized vector instructions parallelize the pixel-level calculations of the 8x8 CDEF search and directional filtering, as well as the 7-tap Wiener and Self-Guided Restoration (SGR) algorithms used in Loop Restoration.
ARM Optimization (NEON): Customized NEON assembly ensures efficient execution on mobile devices and single-board computers.

By writing directly in assembly rather than relying on compiler auto-vectorization, libdav1d developers maximize register usage, reduce instruction latency, and leverage platform-specific execution pipelines.

Pipelining and Cache Locality

Memory bandwidth is often a greater bottleneck than raw CPU performance. To mitigate this, libdav1d uses a pipelined design that applies loop filters as soon as possible after a block is reconstructed.

Instead of decoding an entire frame and then passing it through all three filters sequentially, libdav1d processes the filters on smaller block levels (usually row-by-row or superblock-by-superblock). This approach keeps the pixel data in the CPU’s fast L1/L2 cache, preventing expensive round-trips to the main system memory (RAM).

Row-Based Multi-Threading

libdav1d features a highly sophisticated multi-threading model that schedules decoding and filtering tasks dynamically using a wavefront tracking system.

As soon as a row of superblocks finishes reconstruction, the deblocking filter begins. Once deblocking is completed for a region, CDEF is immediately scheduled, followed by Loop Restoration. This fine-grained dependency tracking ensures that all available CPU cores are kept busy working on different stages of the decoding pipeline simultaneously, maximizing CPU utilization on multi-core systems.

Algorithmic Simplifications

Beyond hardware-level optimizations, libdav1d avoids redundant calculations. For instance, the decoder fast-paths blocks that do not require filtering—such as skipped blocks or blocks with zero filter strength. By quickly identifying and skipping these areas, the decoder saves precious CPU cycles on flat or low-detail scenes, further boosting real-time playback performance.