On 2/18/2021 4:47 PM, Junio C Hamano wrote: > "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > >> +Chunk-based file formats >> +======================== >> + >> +Some file formats in Git use a common concept of "chunks" to describe >> +sections of the file. This allows structured access to a large file by >> +scanning a small "table of contents" for the remaining data. This common >> +format is used by the `commit-graph` and `multi-pack-index` files. See >> +link:technical/pack-format.html[the `multi-pack-index` format] and >> +link:technical/commit-graph-format.html[the `commit-graph` format] for >> +how they use the chunks to describe structured data. > > I've read the doc added here to the end; well written and easy to > understand. > > I wonder how/if well reftable files fit in the scheme, or if it > doesn't, should the chunk file format API be updated to accomodate > it (or the other way around)? I'm not sure that reftable can work with this format, especially with its design to do most updates as append-only (IIUC). And to change the format to work with the chunk format would violate the compatibility with the JGit version. I would be interested if something like the packed-refs file could use a minor update, but only if there is a realistic benefit to using chunks over the current format. The files that are on my radar for adopting a new file format using the chunk-format API are: * reachability bitmaps: using a similar approach to the commit-graph, we could avoid parsing the entire file before checking if a specific commit has a bitmap. (Requires a commit lookup chunk, a bitmap data chunk, and an offset chunk to connect them.) * index v5: I'm trying to collect a bunch of information about how to update the index for better compression, and the chunk-based approach can provide some fixed-width columns that can vary in length depending on the required data (presenting the interesting behavior from v2 and v3, along with possible approaches previously presented as a potential v5). The paths could be presented as a chunk, giving the interesting options between v2/3 and v4 (prefix compression). I haven't even started the actual work here, but I've been thinking about it a lot. I'll have time next month to start prototyping. Are there other interesting files that could use a new version here? What other pain points are known to experts in the area? >> +Extract the data information for each chunk using `pair_chunk()` or >> +`read_chunk()`: >> + >> +* `pair_chunk()` assigns a given pointer with the location inside the >> + memory-mapped file corresponding to that chunk's offset. If the chunk >> + does not exist, then the pointer is not modified. > > I think it is worth adding: > > The caller is expected to know where the returned chunk ends by > some out-of-band means, as this function only gives the offset > but not the size, unlike the read_chunk() function. True. I suppose that could be more explicit, although it can be gleaned from the omission of any size information. Thanks, -Stolee