Lossless JPEG decoder performance on CPU, benchmarks

Fast DNG decoding on CPU

Lossless JPEG encoding algorithm is widely used in many photo and video cameras shooting RAW. This is a must to increase the number of frames which could be stored in internal memory or flash card of the camera. Lossless encoding guarantees exact reconstruction (mathematically lossless decoding algorithm) of original RAW data, though compression ratio for that algorithm is quite moderate, usually it's around 1.5–2. Lossless JPEG algorithm (this is not JPEG-LS) employs a predictor function to ensure image compression by encoding the prediction error without any loss of information. For lossless JPEG, the standard permits any data precision between 2 and 16 bits per sample. Lossless encoding for RAW data is usually done in realtime inside camera, and this is the case for majority of cameras.

Lossless JPEG compression algorithm

We consider here Lossless JPEG encoding algorithm only for grayscale (bayer pattern) images with arbitrary width and height.
As soon as we need to encode raw data, we have to bare in mind that original image is Raw Bayer CFA with demosaicing pattern RGGB or alike. That's why we can virtually increase image width in two times and to decrease image height in two times as well. It's good idea to ensure better data correlation at encoding.
Before lossless jpeg encoding, we have to choose prediction formula (one from 7 choices) to encode the difference between original and predicted values of each pixel. The most frequent choice is two-dimensional predictor from the formula No.6: Px = Rb + (Ra – Rc)/2. It means that for prediction we utilize values from upper pixel plus half of the difference between left and upper-left pixels.
Numerically lossless JPEG compression is done according to Huffman coding algorithm with fixed table. If an image has just one component, then one Huffman table is enough. Usually there are two components, so in that case we need one or two Huffman tables.
From the very beginning we split image into tiles to encode them independently. After compression we place all encoded tiles into RAW and add offset for each tile to the header.

Lossless JPEG decoding

Performance of Lossless JPEG decoder is mostly limited by Huffman decoding. Actually, we need to read bitstream (bit after bit), to recover Huffman codes and data bits. Huffman decompression is essentially serial algorithm, so one can't implement it on GPU, that's why everything is done on CPU.
Right after decoding we need to restore original pixel values according to prediction formula, and to restore original image width and height.
After decoding of all tiles, we compose original uncompressed RAW image.

Many existing libraries for lossless jpeg decoding (dcraw, libraw, libjpeg, Adobe DNG SDK Decoder, etc.) are not optimized for speed, which leads to slow RAW decoding. This is not actually a problem for processing of just one frame, but it could be not fast enough for workflow with high resolution images (up to 50–100 MPix and more), for batch processing or for Raw Video Player with smooth output in realtime at 24–30 fps.

PC for testing

CPU Intel Core i7-6700 (Skylake, 4 cores, 3.4–4.0 GHz)
GPU NVIDIA GeForce GTX 1080 (Pascal, 20 SMM, 2560 cores, 1.6–1.7 GHz)
OS Windows 10 (x64)

Lossless JPEG decoders to compare

We will test the following lossless jpeg decoders:

Lossless JPEG decoder from Adobe DNG SDK
LJ92 decoder (liblj92 library)
LJ Decoder from Fastvideo (lossless jpeg library on CPU)

In real use case, multithreaded decoder software is utilized, and each tile (or each frame) is decoded in a separate thread. We've done comparison for single thread applications for the same DNG images and at the same hardware. DNG decoding is fully done on CPU, no image processing on GPU is applied. Test images have resolutions from 2 MPix to 16 MPix, 12-bit or 16-bit, one or two components (just one Huffman table), one tile, demosaicing pattern is RGGB, performance values correspond to decoding computations only.

DNG decoding benchmarks for single thread applications

Adobe DNG SDK (12–16 bits): 30–32 MPix/s
LJ92 (library liblj92, 12–16 bits): 45–50 MPix/s
Fastvideo LJ Decoder (16-bit data): 60–70 MPix/s
Fastvideo LJ Decoder (12-bit data): 75–90 MPix/s

The fastest result of lossless jpeg decoding could be achieved with Fastvideo LJ Decoder due to highly-optimized Huffman decompression routines. We utilize that library to speedup DNG decoding at Fast CinemaDNG Processor software. This is very important issue to ensure smooth video preview for CinemaDNG footages in realtime.

In the case of 12/14/16-bit DNG decoding, we see that Adobe and LJ92 decoders have almost the same performance for 12-bit and 16-bit data. The best result of Fastvideo LJ Decoder is achieved for 12-bit data, though for 16-bit compressed data its performance is still better than LJ92 and Adobe decoder.

DNG decoding performance for 8-thread applications

Here there are some benchmarks which correspond to the best and the worst cases of Lossless JPEG decoding for multithreaded applications. These examples illustrate the idea of multithreading performance for lossless jpeg decoding on multicore CPU. We can see that performance is growing in non-linear way and we need to take that into account.

Test #1: 16-bit image, compression ratio 10.4 bpp (lossless compression)

LJ92 (library liblj92): 266 MPix/s
Fastvideo LJ Decoder: 407 MPix/s

Test #2: 12-bit image, compression ratio 5.6 bpp (lossless compression)

LJ92 (library liblj92): 284 MPix/s
Fastvideo LJ Decoder: 475 MPix/s

These benchmarks show that fast CDNG decoding on CPU is possible for realtime applications with resolutions 4K and more, up to 6K. Decoding optimization, vectorization and multithreading are key factors to achieve the best decoding performance.

Soon we will also implement fast decoding for CR2, NEF, ARW and other RAW formats which are utilizing lossless jpeg compression.