Re: [PATCH v4 17/17] chunk-format: add technical docs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/18/2021 4:47 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> +Chunk-based file formats
>> +========================
>> +
>> +Some file formats in Git use a common concept of "chunks" to describe
>> +sections of the file. This allows structured access to a large file by
>> +scanning a small "table of contents" for the remaining data. This common
>> +format is used by the `commit-graph` and `multi-pack-index` files. See
>> +link:technical/pack-format.html[the `multi-pack-index` format] and
>> +link:technical/commit-graph-format.html[the `commit-graph` format] for
>> +how they use the chunks to describe structured data.
> 
> I've read the doc added here to the end; well written and easy to
> understand.
> 
> I wonder how/if well reftable files fit in the scheme, or if it
> doesn't, should the chunk file format API be updated to accomodate
> it (or the other way around)?

I'm not sure that reftable can work with this format, especially with
its design to do most updates as append-only (IIUC). And to change the
format to work with the chunk format would violate the compatibility
with the JGit version. I would be interested if something like the
packed-refs file could use a minor update, but only if there is a
realistic benefit to using chunks over the current format.

The files that are on my radar for adopting a new file format using the
chunk-format API are:

* reachability bitmaps: using a similar approach to the commit-graph,
  we could avoid parsing the entire file before checking if a specific
  commit has a bitmap. (Requires a commit lookup chunk, a bitmap data
  chunk, and an offset chunk to connect them.)

* index v5: I'm trying to collect a bunch of information about how to
  update the index for better compression, and the chunk-based approach
  can provide some fixed-width columns that can vary in length depending
  on the required data (presenting the interesting behavior from v2 and v3,
  along with possible approaches previously presented as a potential v5).
  The paths could be presented as a chunk, giving the interesting options
  between v2/3 and v4 (prefix compression). I haven't even started the
  actual work here, but I've been thinking about it a lot. I'll have time
  next month to start prototyping.

Are there other interesting files that could use a new version here?
What other pain points are known to experts in the area?

>> +Extract the data information for each chunk using `pair_chunk()` or
>> +`read_chunk()`:
>> +
>> +* `pair_chunk()` assigns a given pointer with the location inside the
>> +  memory-mapped file corresponding to that chunk's offset. If the chunk
>> +  does not exist, then the pointer is not modified.
> 
> I think it is worth adding:
> 
>     The caller is expected to know where the returned chunk ends by
>     some out-of-band means, as this function only gives the offset
>     but not the size, unlike the read_chunk() function.

True. I suppose that could be more explicit, although it can be gleaned
from the omission of any size information.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux