How libdav1d Handles AV1 Super-Resolution

This article provides a technical overview of how the libdav1d AV1 decoder internally processes the AV1 Super-Resolution feature. It explains the decoding pipeline sequence, how horizontal-only scaling is integrated with loop filters, and the assembly-level optimizations used to achieve high-performance real-time playback.

The AV1 Super-Resolution Concept

AV1 Super-Resolution allows a frame to be encoded at a lower horizontal resolution (the coded width) and then scaled back to its original horizontal size (the upscaled width) during the decoding process. This technique reduces the bitrate by processing fewer pixels during entropy decoding and early loop filtering, while still outputting a full-resolution image. Unlike standard post-processing scalers, Super-Resolution in AV1 is normatively integrated directly into the decoding loop.

Inside the libdav1d Decoding Pipeline

In libdav1d, the Super-Resolution process is carefully positioned within the frame reconstruction pipeline. It does not occur at the very end of decoding; instead, it is executed sequentially between specific loop-filtering stages:

Coded-Resolution Stages: The frame is reconstructed, and both the Deblocking Filter (LF) and the Constrained Directional Enhancement Filter (CDEF) are applied to the frame at its smaller, coded resolution. This reduces computational overhead because these filters run on fewer pixels.
Horizontal Scaling (Super-Resolution): Once CDEF is complete, libdav1d applies the Super-Resolution process. It scales the frame horizontally using a normative 8-tap polyphase filter.
Upscaled-Resolution Stages: After the frame is horizontally upscaled, Loop Restoration (LR) is applied at the full, upscaled resolution. Finally, if film grain is present in the bitstream, Film Grain Synthesis is applied to the upscaled image.

Memory Management and Threading

libdav1d is designed for massive multi-threading, utilizing both frame-level and tile/row-level threading. Managing Super-Resolution internally requires sophisticated buffer handling:

Line Buffers: Because Loop Restoration runs at a different resolution than CDEF, libdav1d allocates specialized line buffers to hold the boundaries of the scaled and unscaled rows.
Synchronization: The row-processing threads must synchronize. A thread cannot begin the Loop Restoration phase on an upscaled row until the Super-Resolution filter has finished processing the corresponding CDEF-filtered row.
On-the-Fly Scaling: To optimize cache locality, libdav1d avoids writing the intermediate CDEF output back to main memory only to read it again for scaling. Instead, it attempts to scale the rows in-cache and feed them directly into the Loop Restoration engine.

SIMD and Assembly Optimizations

The 8-tap horizontal scaling filter is mathematically intensive. To prevent Super-Resolution from becoming a bottleneck, the libdav1d codebase features hand-written assembly optimizations for various hardware architectures:

x86 Platforms: Highly optimized implementations using SSSE3, AVX2, and AVX-512 vector instructions. These assembly routines vectorize the 8-tap filter coefficients, processing multiple horizontal pixels in parallel.
ARM Platforms: Dedicated NEON assembly implementations designed for mobile and embedded processors, ensuring efficient register usage and memory alignment.

By leveraging these hardware-specific optimizations and integrating the scaling directly into the row-threading pipeline, libdav1d minimizes the CPU and memory bandwidth overhead associated with AV1 Super-Resolution.