Re: [PATCH v4 17/17] chunk-format: add technical docs

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 18 Feb 2021 13:47:51 -0800

"Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> +Chunk-based file formats
> +========================
> +
> +Some file formats in Git use a common concept of "chunks" to describe
> +sections of the file. This allows structured access to a large file by
> +scanning a small "table of contents" for the remaining data. This common
> +format is used by the `commit-graph` and `multi-pack-index` files. See
> +link:technical/pack-format.html[the `multi-pack-index` format] and
> +link:technical/commit-graph-format.html[the `commit-graph` format] for
> +how they use the chunks to describe structured data.

I've read the doc added here to the end; well written and easy to
understand.

I wonder how/if well reftable files fit in the scheme, or if it
doesn't, should the chunk file format API be updated to accomodate
it (or the other way around)?

> +Extract the data information for each chunk using `pair_chunk()` or
> +`read_chunk()`:
> +
> +* `pair_chunk()` assigns a given pointer with the location inside the
> +  memory-mapped file corresponding to that chunk's offset. If the chunk
> +  does not exist, then the pointer is not modified.

I think it is worth adding:

    The caller is expected to know where the returned chunk ends by
    some out-of-band means, as this function only gives the offset
    but not the size, unlike the read_chunk() function.

> +* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it
> +  with the appropriate initial pointer and size information. The function
> +  is not called if the chunk does not exist. Use this method to read chunks
> +  if you need to perform immediate parsing or if you need to execute logic
> +  based on the size of the chunk.
> +
> +After calling these methods, call `free_chunkfile()` to clear the
> +`struct chunkfile` data. This will not close the memory-mapped region.
> +Callers are expected to own that data for the timeframe the pointers into
> +the region are needed.

Thanks.