Proposed zchunk file format

Jonathan Dieter <jdieter@xxxxxxxxx> · Fri, 16 Feb 2018 20:52:23 +0200

So here's my proposed file format for the zchunk file.  Should I add
some flags to facilitate possible different compression formats?

+-+-+-+-+-+-+-+-+-+-+-+-+==================+=================+
|  ID   |  Index size   | Compressed Index | Compressed Dict |
+-+-+-+-+-+-+-+-+-+-+-+-+==================+=================+

+===========+===========+
|   Chunk   |   Chunk   | ==> More chunks
+===========+===========+

ID
 '\0ZCK', identifies file as zchunk file

Index size
 This is a 64-bit unsigned integer containing the size of compressed 
 index.

Compressed Index
 This is the index, which is described in the next section.  The index 
 is compressed using standard zstd compression without a custom 
 dictionary.

Compressed Dict
 This is a custom dictionary used when compressing each chunk.  
 Because each chunk is compressed completely separately from the 
 others, the custom dictionary gives us much better overall 
 compression.  The custom dictionary is compressed using standard zstd 
 compression without using a separate custom dictionary (for obvious 
 reasons).

Chunk
 This is a chunk of data, compressed using zstd with the custom 
 dictionary provided above.

The index:

+++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
|          sha256sum    
     |  End of dict  |
+++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+

+++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+
|          sha256sum          | End of chunk  |  ==> More
+++++++++++++++++++++++++++++++-+-+-+-+-+-+-+-+

sha256sum of compressed dict
 This is a binary sha256sum of the compressed chunk, used to detect 
 whether two dicts are identical.

End of dict
 This is the location of the end of the dict with 0 being the end of 

the index.  This gives us the information we need to find and 
 decompress the dict.

sha256sum of compressed chunk
 This is a binary sha256sum of the compressed chunk, used to detect 

whether any two chunks are identical.

End of chunk
 This is the location of the end of the chunk with 0 being the end of 
 the index.  This gives us the information we need to find and 
 decompress each chunk.

The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx