Am 08.12.20 um 19:49 schrieb Taylor Blau: > So, I think that we should pursues that direction a little further > before deciding whether or not this is worth continuing. My early > experiments showed that it does add a little more code to the > chunk-format.{c,h} files, but you get negative diffs in midx.c and > commit-graph.c, which is more in line with what I would expect from this > series. OK. > I do think that the "overhead" here is more tolerable than we might > think; I'd rather have a well-documented "chunkfile" implementation > written once and called twice, than two near-identical implementations > of _writing_ the chunks / table of contents at each of the call sites. > So, even if this does end up being a net-lines-added kind of diff, I'd > still say that it's worth it. Well, interfaces are hard, and having two similar-but-not-quite-equal pieces of code instead of a central API implementation trying to serve two callers can actually be better. I'm not too familiar with the chunk producers and consumers, so I can only offer some high-level observations. And I don't have to use the API, so go wild! ;) I was just triggered by the appearance of two working pieces of code being replaced by two slightly different pieces of code plus a third one on top. > With regards to the "YAGNI" comment... I do have thoughts about > extending the reachability bitmap format to use chunks (of course, this > would break compatibility with JGit, and it isn't something that I plan > to do in the short-term, or even necessarily in the future). > > In any event, I'm sure that this won't be these two won't be the last > chunk-based formats that we have in Git. OK, so perhaps we can do better before this scheme is copied. The write side is complicated by the fact that the table of contents (TOC) is written first, followed by the actual chunks. That requires two passes over the data. The ZIP format solved a similar issue by placing the TOC at the end, which allows for one-pass streaming. Another way to achieve that would be to put the TOC in a separate file, like we do for .pack and .idx files. This way you could have a single write function for chunks, and writers would just be a single sequence of calls for the different types. But seeing that the read side just loads all of the chunks anyway (skipping unknown IDs) I wonder why we need a TOC at all. That would only be useful if callers were trying to read just some small subset of the whole file. A collection of chunks for easy dumping and loading could be serialized by writing just a small header for each chunk containing its type and size followed by its payload. René