On Mon, May 06, 2024 at 12:37:35PM -0400, Taylor Blau wrote: > On Mon, May 06, 2024 at 01:52:44PM +0200, Patrick Steinhardt wrote: > > > diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt > > > index f5d200939b0..63a7177ac08 100644 > > > --- a/Documentation/technical/bitmap-format.txt > > > +++ b/Documentation/technical/bitmap-format.txt > > > @@ -255,3 +255,182 @@ triplet is - > > > xor_row (4 byte integer, network byte order): :: > > > The position of the triplet whose bitmap is used to compress > > > this one, or `0xffffffff` if no such bitmap exists. > > > + > > > +Pseudo-merge bitmaps > > > +-------------------- > > > + > > > +If the `BITMAP_OPT_PSEUDO_MERGES` flag is set, a variable number of > > > +bytes (preceding the name-hash cache, commit lookup table, and trailing > > > +checksum) of the `.bitmap` file is used to store pseudo-merge bitmaps. > > > > Here you say that the section is supposed to come before some other > > sections, whereas the first sentence in the "File format" section says > > that it is the last section in a bitmap file. > > This is a quirk of the on-disk .bitmap format. New sections are added > before existing sections, so if you were reading the file from beginning > to end, you'd see the pseudo-merges extension, then the lookup table, > then the name-hash cache (assuming all were written). > > I think that describing them in the order they were introduced here > makes more sense, leaving their layout within the .bitmap file as an > implementation detail. > > If you feel strongly otherwise, let's clean it up outside of this series > since this whole portion of the documentation would need to be > reordered. I don't, thanks for the explanation. [snip] > +=== Overview > + > A "pseudo-merge bitmap" is used to refer to a pair of bitmaps, as > follows: > --- >8 --- > > > > +For example, suppose there exists a pseudo-merge bitmap with a large > > > +number of commits, all of which are listed in the `WANTS` section of > > > +some bitmap traversal query. When pseudo-merge bitmaps are enabled, the > > > +bitmap machinery can quickly determine there is a pseudo-merge which > > > +satisfies some subset of the wanted objects on either side of the query. > > > +Then, we can inflate the EWAH-compressed bitmap, and `OR` it in to the > > > +resulting bitmap. By contrast, without pseudo-merge bitmaps, we would > > > +have to repeat the decompression and `OR`-ing step over a potentially > > > +large number of individual bitmaps, which can take proportionally more > > > +time. > > > + > > > +Another benefit of pseudo-merges arises when there is some combination > > > +of (a) a large number of references, with (b) poor bitmap coverage, and > > > +(c) deep, nested trees, making fill-in traversal relatively expensive. > > > +For example, suppose that there are a large enough number of tags where > > > +bitmapping each of the tags individually is infeasible. Without > > > +pseudo-merge bitmaps, computing the result of, say, `git rev-list > > > +--use-bitmap-index --count --objects --tags` would likely require a > > > +large amount of fill-in traversal. But when a large quantity of those > > > +tags are stored together in a pseudo-merge bitmap, the bitmap machinery > > > +can take advantage of the fact that we only care about the union of > > > +objects reachable from all of those tags, and answer the query much > > > +faster. > > > > I would start the explanation with a discussion of the problem before > > presenting the solution to those problems. In the current version it's > > the other way round, you present a solution to a problem that isn't yet > > explained > > > > It might also be helpful to discuss a bit who is supposed to create > > those pseudo-merge bitmaps. Does Git do so automatically for all tags? > > Does the admin have to configure this? If the latter, when do you want > > to create those and what strategies are there to create them? > > The pseudo-merge bitmaps are created by Git itself, configured via the > options described later on in this series. I'm happy to add a specific > call-out, but I would rather do it elsewhere outside of > Documentation/technical/bitmap-format.txt, which I think should be > mostly focused on the on-disk format. I think what throws me off here is that you already go into the non-technical somewhat by explaining their usecases. This causes us to end up halfwhere between "We motivate the changes" and "We document the technical parts, only". [snip] > > In case you have multiple pseudo-merge bitmaps, is the whole of the > > above repeated for each bitmap or is it just parts of it? > > The "pseudo-merge bitmaps" section contains a variable number of pairs > of EWAH bitmaps, one pair for each pseudo-merge bitmap. I think this is > covered below where it says "one or more pseudo-merge bitmaps, each > containing: [...]", but let me know if I should be more explicit. > > > > +* An (optional) extended lookup table (written if and only if there is > > > + at least one commit which appears in more than one pseudo-merge). > > > + There are as many entries as commits which appear in multiple > > > + pseudo-merges. Each entry contains the following: > > > + > > > + ** `N`, a 4-byte unsigned value equal to the number of pseudo-merges > > > + which contain a given commit. > > > > How exactly is the given commit identified? Or in other words, given an > > entry in the lookup table here, how do I figure out what commit it > > belongs to? > > They aren't identified within this section. The extended lookup table is > indexed into via the lookup table with an offset that is stored in the > `offset` field when the MSB is set. Okay. Would this explanation be a good addition to the document? Patrick
Attachment:
signature.asc
Description: PGP signature