Re: Proposed zchunk file format - V3

Jonathan Dieter <jdieter@xxxxxxxxx> · Wed, 28 Feb 2018 19:52:54 +0200

I've been working on a C implementation of this spec, and came up with
a few other changes.  I think it's important to have a checksum of the
index as well as the data as we want to be able to verify that the
index is as expected before trying to parse it.

I've also added in the ability to use a different hash type for the
chunk checksums versus the index checksum and overall checksum.  The
idea is that a weaker checksum may be chosen for the chunks to reduce
the size of the index without weakening overall verification.

+-+-+-+-+-+---------------+================+------------------+
|    ID   | Checksum type | Index checksum | Compression type |
+-+-+-+-+-+---------------+================+------------------+

+-+-+-+-+-+-+-+-+==================+=================+
|  Index size   | Compressed Index | Compressed Dict |
+-+-+-+-+-+-+-+-+==================+=================+

+===========+===========+
|   Chunk   |   Chunk   | ==> More chunks
+===========+===========+

ID
 '\0ZCK1', identifies file as zchunk version 1 file

Checksum type
 This is an 8-bit unsigned integer containing the type of checksum
 used to generate the index checksum and the total data checksum, but
 *not* the chunk checksums

 Current values:
   0 = SHA-1
   1 = SHA-256

Index checksum
 This is the checksum of everything from this point until the end of
 the index.  It includes the compression type, the index size, and the
 compressed index.

Compression type
 This is an 8-bit unsigned integer containing the type of compression
 used to compress dict and chunks.

 Current values:
   0 - Uncompressed
   2 - zstd

Index size
 This is a 64-bit LE unsigned integer containing the size of compressed
 index.

Compressed Index
 This is the index, which is described in the next section.  The index
 is compressed without a custom dictionary.

Compressed Dict (optional)
 This is a custom dictionary used when compressing each chunk.
 Because each chunk is compressed completely separately from the
 others, the custom dictionary gives us much better overall
 compression.  The custom dictionary is compressed without a custom
 dictionary (for obvious reasons).

Chunk
 This is a chunk of data, compressed with the custom dictionary
 provided above.

The index:

+---------------------+-+-+-+-+-+-+-+-+======================+
| Chunk checksum type |  Chunk count  | Checksum of all data |
+---------------------+-+-+-+-+-+-+-+-+======================+

+================+-+-+-+-+-+-+-+-+
| Dict checksum  |  End of dict  |
+================+-+-+-+-+-+-+-+-+

+================+-+-+-+-+-+-+-+-+
| Chunk checksum | End of chunk  |  ==> More
+================+-+-+-+-+-+-+-+-+

Chunk checksum type
 This is an 8-bit unsigned integer containing the type of checksum used
 to generate the chunk checksums.

 Current values:
   0 = SHA-1
   1 = SHA-256

Chunk count
 This is a count of the number of chunks in the zchunk file.

Checksum of all data
 This is the checksum of everything after the index, including the
 compressed dict and all the compressed chunks.  This checksum is
 generated using the overall checksum type, *not* the chunk checksum
 type.

Dict checksum
 This is the checksum of the compressed dict, used to detect whether
 two dicts are identical.  If there is no dict, the checksum must be
 all zeros.

End of dict
 This is a 64-bit LE unsigned integer containing the location of the
 end of the dict starting from the end of the index.  This gives us the
 information we need to find and decompress the dict.  If there is no
 dict, this must be a zero.

Chunk checksum
 This is the checksum of the compressed chunk, used to detect whether
 any two chunks are identical.

End of chunk
 This is the location of the end of the chunk starting from the end of
 the index.  This gives us the information we need to find and
 decompress each chunk.

The index is designed to be able to be extracted from the file on the
server and downloaded separately, to facilitate downloading only the
parts of the file that are needed, but must then be re-embedded when
assembling the file so the user only needs to keep one file.
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx