This project is read-only.

Random access in DEFLATE stream

Introduction

This article demonstrates how to extract the portion of the information from a stream that was compressed by the DEFLATE encoding. The DEFLATE encoding is applied in the popular formats such as ZLIB, GZIP, PKZIP, and many more. One of the weaknesses of compression streams is that they do not allow to access/read data from specific location – the data has to be read from the beginning to the point where desired information starts. Good example is TAR format with applied GZIP compression (.tar.gz): it is impossible to read only the content of the archive or to extract specific file without reading the file from the start without skipping a bit of information.

The article defines lightweight index that allows reading a DEFLATE stream without adding SYNC points or sacrificing compression ratio by clearing compressor’s buffer of the stream during compression. The index is created after the actual compression was performed by regular DEFLATE compressor. That allows applications, such as graphical archive tools, compressed databases or backup utilities, to read data from any position in the compressed files and to avoid performance penalty.

Background

The DEFLATE is a lossless compressed data format that compresses data with acceptable efficiency and speed. One problem: the format does not attempt to provide random access to the compressed data. It is connected with the way how the compressor works. The DEFLATE compressor reduces amount of bits for frequently used character using Huffman trees and uses LZ77 algorithm to reference duplicated substring that contained in the previous data.

The compressed stream consists of three block types that are encoded as:
  • Non-compressed stream of bytes (as is);
  • Using standard Huffman trees;
  • Using dynamically created Huffman trees.

Last block type has got the overhead of storing Huffman trees in the stream. The compressor can use any block type at any position of the stream.

The compressor can reference previous data in range up to 32768. Reference of the previous data past the start of the current block is allowed. Theoretically, the data that was used at the start of the uncompressed data can be carried across blocks to the end during compression by using previous data reference. That make impossible to just start reading from specific byte or bit of compressed stream – knowledge about previous data may be required.

The compressed data can be hinted in equal intervals. At the hint position reader-decoder will save the state (bit position, necessary previous data, etc). That will allow decoding of the data of the stream from that location. The hinting is an overhead that may vary from 10 to 32K per every hint point.

Index File Format

The index file/stream has three parts: header, hint states and states references. Let’s define few number encodings: unsigned 8-bit, 32-bit and 64-bit numbers (U8, U32 and U64), unsigned integer of variable length (U), and deflated stream of bytes (D). The U8, U32 and U64 numbers are stored as 1, 4 or 8 bytes in little-endian order. The U numbers are stored i portions of 7-bits in little-endian order, the high bit is set when more bits are follow, e.g. 0x23456 will be saved as 0xD6 0xE8 0x80. The deflated stream of bytes is compressed using DEFLATE algorithm.
The header part’s length is 32 bytes and has to be located at the start of the index file/stream:

Offset Type Description
0 U32 File signature “IxD1” (0x31447849)
4 U32 Hint interval
8 U64 Offset of states references (from this header)
16 U64 Uncompressed length of the data
24 U64 Offset of DEFLATE stream with in source file


The state references part is an array of U64 numbers, one for every hint point. The state reference is an offset from the header where state data is stored. The state data variable length and has following structure:

Type Optional Description
U8 No Flags
U No Bit offset in compressed deflate stream where decoder has to start decoding.
U Yes Bit offset in compressed deflate stream of the dynamic tree. Present when block type is defined as 2.
U Yes Number of bytes is left to be decoded in block type 0. See Bit 2 of flags.
U Yes Defined how many bytes from previous data belong to current hint frame. Present when previous data present. See Bits 3 and 4 of flags
U + n * U + D Yes Previous data.


The Flags field contains follwing information:
  • Bit 0-1 – Start DEFLATE block type for decoding;
  • Bit 2 – Set if state structure has block type 0 data left and numbers of bytes left to be decoded less than hint interval;
  • Bit 3 – Set if state structure has encoded previous data;
  • Bit 4 – Set if part of previous data has to be included in decoded data;
  • Bit 5-6 – Reserved;
  • Bit 7 – Set if DEFLATE block is final.

Previous data is represented as series of significant and discarded bytes. Since decoder may refer some within 32K range total length of data will not exceed this number and non all bytes from that 32K range will be used (discarded bytes). The index builder combines all significant bytes in one continued array. There are array of interleaved fragments lengths for significant and discarded bytes. For example, {‘A’, ‘B’, ?, ?, ‘C’, ?, ‘B’,’C’} data will be presented as array of lengths { 2, 2, 1, 1, 2 } and array of significant bytes {‘A’, ‘B’, ‘C’, ‘B’, ‘C’ }. The array of lengths is encoded as numbers: first one is amount of items, rest numbers are items them self. The array of significant bytes will be further compressed by DEFLATE algorithm and stored in the state.

Points of Interest

By default, the hint interval is chosen as 65536 (0x10000), but can be changed to any reasonable amount. By increasing the hint interval the index file size will be decreased, however there will be an overhead related to reading and decoding of unused information.

The index file can be used to restore part of the compressed stream, or utilize several processors to decompress the data.

References

  1. Deutsch, L.P., "DEFLATE Compressed Data Format Specification", RFC 1951, May 1996
  2. "Tar (file format)", Wikipedia

Last edited Sep 24, 2009 at 3:16 AM by notmasteryet, version 2

Comments

No comments yet.