Optimizing AV1 Loop Restoration in libdav1d
The AV1 video codec offers superior compression efficiency, but its
post-processing tools, particularly the Loop Restoration (LR) filter,
demand significant computational power. This article explores how
libdav1d, the popular open-source AV1 decoder, processes
the Loop Restoration filter efficiently. We will examine the
implementation strategies that make libdav1d the fastest
AV1 decoder, focusing on its use of assembly-level SIMD vectorization,
cache-friendly memory management, and intelligent multi-threaded
pipelining.
Understanding the Loop Restoration Bottleneck
Loop Restoration in AV1 is an in-loop filtering stage designed to remove compression artifacts and restore high-frequency details. It operates using two main algorithms: the Wiener Filter and the Self-Guided Synthesis (SGR) filter. Because these filters require complex mathematical operations—such as large-tap 2D convolutions—across millions of pixels per second, they can easily bottleneck a decoder if implemented naively.
Advanced SIMD Vectorization
The primary driver of libdav1d’s efficiency is its
extensive use of hand-written assembly language tailored for modern CPU
architectures. Rather than relying on generic C compiler optimizations,
the developers wrote specific implementations for x86 (AVX2, AVX-512,
and SSE) and ARM (NEON) instruction sets.
For the Loop Restoration filter, SIMD (Single Instruction, Multiple
Data) is used to parallelize the horizontal and vertical filtering
passes. * Wiener Filtering: The 7-tap vertical and
horizontal Wiener filters are vectorized to process multiple pixel
values in a single instruction cycle. libdav1d utilizes
coefficient scaling and fixed-point arithmetic to avoid slow
floating-point operations. * Self-Guided Restoration:
The SGR filter, which involves box-blur and variance calculations, is
optimized by vectorizing the integral image generation and local pixel
standard deviation computations.
Cache-Friendly Stripe-Based Processing
Memory bandwidth is often a tighter bottleneck than raw CPU cycles.
To prevent the CPU from constantly fetching data from slow system RAM,
libdav1d processes video frames in localized
“stripes”—typically 64 pixels high.
Instead of applying the Deblocking Filter (DF), the Constrained
Directional Enhancement Filter (CDEF), and Loop Restoration (LR) in
separate, sequential passes over the entire frame, libdav1d
merges these steps into a single pipelined pass. As soon as a stripe of
pixels is decoded and deblocked, it is immediately processed by the CDEF
and LR filters while the pixel data is still residing in the fast L1/L2
CPU cache. This drastically reduces memory read/write cycles.
Line Buffering and Boundary Handling
Both Wiener and SGR filters require access to pixels outside the current processing block (boundary pixels). Normally, this requires copying boundary pixels to a separate temporary buffer, creating overhead.
libdav1d minimizes this overhead through highly
optimized line buffers. It stores only the absolute minimum number of
boundary rows needed for the filter taps. Furthermore, specialized
assembly kernels are written to handle the top and bottom boundaries of
stripes differently, avoiding complex conditional branching inside the
inner loops.
Fine-Grained Threading and Pipelining
To utilize modern multi-core processors, libdav1d
employs a sophisticated task-based threading model. Rather than
assigning entire frames to individual threads, it breaks frames down
into smaller jobs.
Loop Restoration tasks are queued dynamically. As soon as a row of tiles completes the preceding reconstruction stages, a Loop Restoration task is spawned for that row. This fine-grained pipelining ensures that CPU cores are never idle waiting for an entire frame to decode, resulting in highly efficient multi-threaded scaling, even on low-power mobile processors.