How libdav1d Minimizes Thread Synchronization Overhead
This article explores how libdav1d, the open-source AV1
video decoder, achieves its industry-leading performance by minimizing
thread synchronization overhead. We examine its use of fine-grained
row-based threading, lightweight lock-free atomic variables, and a
highly efficient dynamic job scheduling system that collectively prevent
thread starvation and CPU idling.
Fine-Grained Row-Based Threading
Traditional video decoders often rely on frame-level threading, where each thread decodes an entire frame, or tile-level threading, which depends on how the encoder structured the video. Frame-level threading introduces significant latency, while tile-level threading can suffer from load imbalances if tiles are of unequal sizes.
libdav1d solves this by implementing fine-grained
row-based threading (specifically, superblock row threading). Instead of
waiting for an entire frame to finish, a thread can begin decoding a row
of superblocks in a frame as soon as the dependent blocks in the row
above it—and the corresponding reference frames—have been decoded. This
significantly increases thread density and concurrency, ensuring that
multi-core processors are fully utilized.
Lock-Free Dependency Tracking with C11 Atomics
In multi-threaded decoding, threads must constantly check whether their dependencies (such as reference frames or neighboring blocks) are ready. Relying on traditional operating system locks, such as mutexes and condition variables, for these frequent checks introduces massive overhead due to context switching.
To bypass this, libdav1d uses C11 atomic operations for
dependency tracking. Each frame and superblock row maintains an atomic
progress counter. When a thread finishes decoding a segment of a frame,
it updates the progress counter using a highly efficient atomic store.
Neighboring threads waiting on this data perform cheap atomic loads to
check progress. By avoiding mutexes for routine dependency checks,
libdav1d allows threads to synchronize at the hardware
level with minimal CPU cycle wastage.
Dynamic Task Scheduling and Work Stealing
Rather than statically assigning specific frames or rows to
designated threads, libdav1d utilizes a centralized,
dynamic task queue. A pool of worker threads constantly pulls available
jobs from this queue.
A job only enters the queue when its execution dependencies are completely met. If a thread finishes its current task, it immediately queries the queue for the next ready superblock row. This dynamic allocation prevents “thread starvation,” where some threads sit idle waiting for a slow thread to finish. It also eliminates the synchronization overhead of managing complex thread-to-task mapping.
Optimized Thread Parking
While lock-free atomics handle active decoding synchronization, threads must still be put to sleep (parked) when there is absolutely no work available (e.g., during low-resolution playback or when waiting for I/O).
libdav1d minimizes the overhead of parking and waking
threads by using a custom condition variable wrapper. It only invokes
heavyweight operating system synchronization primitives when a thread
must transition to a deep sleep state. During brief periods of
dependency waiting, threads may perform brief, optimized spin-locks
before sleeping, ensuring that they can resume work instantly once the
dependency is resolved.