Re: [PATCH v2 01/23] Documentation/technical: describe pseudo-merge bitmaps format

Patrick Steinhardt <ps@xxxxxx> · Fri, 10 May 2024 13:46:59 +0200

On Mon, May 06, 2024 at 12:37:35PM -0400, Taylor Blau wrote:
> On Mon, May 06, 2024 at 01:52:44PM +0200, Patrick Steinhardt wrote:
> > > diff --git a/Documentation/technical/bitmap-format.txt b/Documentation/technical/bitmap-format.txt
> > > index f5d200939b0..63a7177ac08 100644
> > > --- a/Documentation/technical/bitmap-format.txt
> > > +++ b/Documentation/technical/bitmap-format.txt
> > > @@ -255,3 +255,182 @@ triplet is -
> > >  	xor_row (4 byte integer, network byte order): ::
> > >  	The position of the triplet whose bitmap is used to compress
> > >  	this one, or `0xffffffff` if no such bitmap exists.
> > > +
> > > +Pseudo-merge bitmaps
> > > +--------------------
> > > +
> > > +If the `BITMAP_OPT_PSEUDO_MERGES` flag is set, a variable number of
> > > +bytes (preceding the name-hash cache, commit lookup table, and trailing
> > > +checksum) of the `.bitmap` file is used to store pseudo-merge bitmaps.
> >
> > Here you say that the section is supposed to come before some other
> > sections, whereas the first sentence in the "File format" section says
> > that it is the last section in a bitmap file.
> 
> This is a quirk of the on-disk .bitmap format. New sections are added
> before existing sections, so if you were reading the file from beginning
> to end, you'd see the pseudo-merges extension, then the lookup table,
> then the name-hash cache (assuming all were written).
> 
> I think that describing them in the order they were introduced here
> makes more sense, leaving their layout within the .bitmap file as an
> implementation detail.
> 
> If you feel strongly otherwise, let's clean it up outside of this series
> since this whole portion of the documentation would need to be
> reordered.

I don't, thanks for the explanation.

[snip]
> +=== Overview
> +
>  A "pseudo-merge bitmap" is used to refer to a pair of bitmaps, as
>  follows:
> --- >8 ---
> 
> > > +For example, suppose there exists a pseudo-merge bitmap with a large
> > > +number of commits, all of which are listed in the `WANTS` section of
> > > +some bitmap traversal query. When pseudo-merge bitmaps are enabled, the
> > > +bitmap machinery can quickly determine there is a pseudo-merge which
> > > +satisfies some subset of the wanted objects on either side of the query.
> > > +Then, we can inflate the EWAH-compressed bitmap, and `OR` it in to the
> > > +resulting bitmap. By contrast, without pseudo-merge bitmaps, we would
> > > +have to repeat the decompression and `OR`-ing step over a potentially
> > > +large number of individual bitmaps, which can take proportionally more
> > > +time.
> > > +
> > > +Another benefit of pseudo-merges arises when there is some combination
> > > +of (a) a large number of references, with (b) poor bitmap coverage, and
> > > +(c) deep, nested trees, making fill-in traversal relatively expensive.
> > > +For example, suppose that there are a large enough number of tags where
> > > +bitmapping each of the tags individually is infeasible. Without
> > > +pseudo-merge bitmaps, computing the result of, say, `git rev-list
> > > +--use-bitmap-index --count --objects --tags` would likely require a
> > > +large amount of fill-in traversal. But when a large quantity of those
> > > +tags are stored together in a pseudo-merge bitmap, the bitmap machinery
> > > +can take advantage of the fact that we only care about the union of
> > > +objects reachable from all of those tags, and answer the query much
> > > +faster.
> >
> > I would start the explanation with a discussion of the problem before
> > presenting the solution to those problems. In the current version it's
> > the other way round, you present a solution to a problem that isn't yet
> > explained
> >
> > It might also be helpful to discuss a bit who is supposed to create
> > those pseudo-merge bitmaps. Does Git do so automatically for all tags?
> > Does the admin have to configure this? If the latter, when do you want
> > to create those and what strategies are there to create them?
> 
> The pseudo-merge bitmaps are created by Git itself, configured via the
> options described later on in this series. I'm happy to add a specific
> call-out, but I would rather do it elsewhere outside of
> Documentation/technical/bitmap-format.txt, which I think should be
> mostly focused on the on-disk format.

I think what throws me off here is that you already go into the
non-technical somewhat by explaining their usecases. This causes us to
end up halfwhere between "We motivate the changes" and "We document the
technical parts, only".

[snip]
> > In case you have multiple pseudo-merge bitmaps, is the whole of the
> > above repeated for each bitmap or is it just parts of it?
> 
> The "pseudo-merge bitmaps" section contains a variable number of pairs
> of EWAH bitmaps, one pair for each pseudo-merge bitmap. I think this is
> covered below where it says "one or more pseudo-merge bitmaps, each
> containing: [...]", but let me know if I should be more explicit.
> 
> > > +* An (optional) extended lookup table (written if and only if there is
> > > +  at least one commit which appears in more than one pseudo-merge).
> > > +  There are as many entries as commits which appear in multiple
> > > +  pseudo-merges. Each entry contains the following:
> > > +
> > > +  ** `N`, a 4-byte unsigned value equal to the number of pseudo-merges
> > > +     which contain a given commit.
> >
> > How exactly is the given commit identified? Or in other words, given an
> > entry in the lookup table here, how do I figure out what commit it
> > belongs to?
> 
> They aren't identified within this section. The extended lookup table is
> indexed into via the lookup table with an offset that is stored in the
> `offset` field when the MSB is set.

Okay. Would this explanation be a good addition to the document?

Patrick
Attachment:
signature.asc

Description: PGP signature