How libdav1d Detects CPU Features at Runtime

This article explains how the libdav1d AV1 decoder detects platform-specific CPU features at runtime to optimize video decoding performance. It details the mechanisms used across different processor architectures—such as x86 and ARM—and explains how the library dynamically dispatches optimized assembly code based on the capabilities of the host processor.

The Need for Runtime CPU Detection

To decode AV1 video efficiently, libdav1d relies heavily on handwritten assembly language tailored to specific CPU instruction set extensions (such as AVX2, AVX-512, or ARM NEON). Because software developers cannot predict what CPU an end-user will have, compile-time optimization is insufficient. Compiling solely for the oldest common denominator results in poor performance, while compiling exclusively for the newest instructions causes crashes on older processors.

To solve this, libdav1d compiles multiple versions of its performance-critical DSP (Digital Signal Processing) functions. At startup, it queries the host CPU to identify supported instruction sets and dynamically links the fastest compatible functions.

How libdav1d Queries CPU Capabilities

The method libdav1d uses to query CPU features depends entirely on the underlying hardware architecture and the operating system.

1. x86 and x86-64 Architectures (Intel and AMD)

On x86 platforms, libdav1d detects vector extensions like SSE2, SSSE3, SSE4.1, AVX2, and AVX-512. It does this using the assembly-level cpuid instruction.

The cpuid Instruction: This is a native CPU instruction that returns processor-specific information in the CPU registers.
OS-Specific Checks: While cpuid indicates if the processor physically supports an instruction set, libdav1d must also verify that the operating system supports saving the corresponding register states during context switches (for example, YMM registers for AVX or ZMM registers for AVX-512). It achieves this by calling xgetbv (Extended Control Register Zero) where necessary.

2. ARM Architectures (AArch32 and AArch64)

ARM processors do not have a standardized, user-space instruction equivalent to cpuid that works uniformly across all operating systems. Therefore, libdav1d relies on operating system APIs to query capabilities like NEON or SVE (Scalable Vector Extension).

Linux and Android: The library uses the auxiliary vector API via getauxval(AT_HWCAP) or parses /proc/cpuinfo to read hardware capability flags exposed by the Linux kernel.
macOS and iOS: It queries the kernel using the sysctlbyname function, checking keys like hw.optional.neon.
Windows on ARM: It calls the Windows API function IsProcessorFeaturePresent.

3. Other Architectures

For other architectures like PPC64 (PowerPC) or RISC-V, libdav1d employs similar OS-specific system calls (such as getauxval on Linux) to check for vector extensions (VSX on PowerPC or the ‘V’ extension on RISC-V).

The Dynamic Dispatch Mechanism

Once libdav1d determines the CPU’s capabilities during initialization, it configures its function pointers. This process is called dynamic dispatch.

The DSP Context: libdav1d maintains a structure containing function pointers for all core decoding operations (such as intra-prediction, loop filtering, and inverse transforms).
Initialization: During the initialization phase, the library checks the detected CPU flag mask.
Pointer Assignment: If the CPU supports AVX2, the library populates the DSP structure with pointers to the AVX2 assembly functions. If the CPU only supports SSE4.1, it loads the SSE4.1 pointers instead. If no hardware acceleration is detected, it falls back to standard C implementations.

This pointer redirection occurs only once per library initialization, ensuring there is zero overhead during the actual frame-by-frame decoding process.

User Control and Debugging

libdav1d allows developers and users to override runtime CPU detection. By modifying the library’s initialization configuration structure, or by setting specific environment variables, users can restrict the decoder to use a subset of CPU features. This is particularly useful for debugging assembly-specific bugs or benchmarking the performance differences between different instruction sets on the same machine.