Re: [PATCH v2 17/17] chunk-format: add technical docs

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 04 Feb 2021 16:15:12 -0800

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> +Chunk-based file formats
> +========================
> +
> +Some file formats in Git use a common concept of "chunks" to describe
> +sections of the file. This allows structured access to a large file by
> +scanning a small "table of contents" for the remaining data. This common
> +format is used by the `commit-graph` and `multi-pack-index` files. See
> +link:technical/pack-format.html[the `multi-pack-index` format] and
> +link:technical/commit-graph-format.html[the `commit-graph` format] for
> +how they use the chunks to describe structured data.
> +
> +A chunk-based file format begins with some header information custom to
> +that format. That header should include enough information to identify
> +the file type, format version, and number of chunks in the file. From this
> +information, that file can determine the start of the chunk-based region.
> +
> +The chunk-based region starts with a table of contents describing where
> +each chunk starts and ends. This consists of (C+1) rows of 12 bytes each,
> +where C is the number of chunks. Consider the following table:
> +
> +  | Chunk ID (4 bytes) | Chunk Offset (8 bytes) |
> +  |--------------------|------------------------|
> +  | ID[0]              | OFFSET[0]              |
> +  | ...                | ...                    |
> +  | ID[C]              | OFFSET[C]              |
> +  | 0x0000             | OFFSET[C+1]            |
> +
> +Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset.
> +Each integer is stored in network-byte order.
> +
> +The chunk identifier `ID[i]` is a label for the data stored within this
> +fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the
> +size of the `i`th chunk is equal to the difference between `OFFSET[i+1]`
> +and `OFFSET[i]`. This requires that the chunk data appears contiguously
> +in the same order as the table of contents.
> +
> +The final entry in the table of contents must be four zero bytes. This
> +confirms that the table of contents is ending and provides the offset for
> +the end of the chunk-based data.
> +
> +Note: The chunk-based format expects that the file contains _at least_ a
> +trailing hash after `OFFSET[C+1]`.

I think the above describes what I saw in the writing side of the
code quite clearly and very well.  I misread that the OFFSET[C+1]
was pointing elsewhere in my review of [2/17] somehow, but the code
is clear that it points at the end of the last chunk from the code,
and the above documents it well.

My comments on the need to document the reading side API, on what
the read_chunk callback should be able to assume (namely, the whole
thing stays in memory until the caller that decided to use chunkfile
API decides to discard it), still stands, I would think.

Thanks.