Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension

Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx> · Tue, 21 Jun 2022 17:22:12 +0530

Taylor Blau <me@xxxxxxxxxxxx> wrote:

> What is the purpose of the GIT_READ_COMMIT_TABLE environment variable? I
> assume that it's to make it easier to run tests (especially performance
> ones) with and without access to the lookup table. If so, we should
> document that (lightly) in the commit message, and rename this to be
> GIT_TEST_READ_COMMIT_TABLE to indicate that it shouldn't be used outside
> of tests.

This is mainly for testing, GIT_TEST_READ_COMMIT_TABLE is perfect.

> All makes sense. Some light documentation might help explain what this
> comparator function is used for (the bsearch() call below in
> bitmap_table_lookup()), although I suspect that this function will get
> slightly more complicated if you pack the table contents as I suggest,
> in which case more documentation will definitely help.

Ok.

> Interesting; this is a point that I forgot about from the original
> patch. xor_pos is an index (not an offset) into the list of commits in
> the table of contents in the order appear in that table. We should be
> clear about (a) what that order is, and (b) that xor_pos is an index
> into that order.

This is exactly what I said in my first reply. I made a mistake here.
(1) As xor_pos is relative to the current bitmap, it depends on the bitmap
entry order. These two order are not same. One is ordered by date, another
is lexicographically ordered. I will fix it.

> Yeah. The problem here is that we can't record every commit that
> _doesn't_ have a bitmap every time we return NULL from one of these
> queries, since there are arbitrarily many such commits that don't have
> bitmaps.
>
> We could approximate it using a Bloom filter or something, and much of
> that code is already written and could be interesting to try and reuse.
> But I wonder if we could get by with something simpler, though, which
> would cause us to load all bitmaps from the lookup table after a fixed
> number of cache misses (at which point we should force ourselves to load
> everything and just read everything out of a single O(1) lookup in the
> stored bitmap table).
>
> That may or may not be a good idea, and the threshold will probably be
> highly dependent on the system. So it may not even be worth it, but I
> think it's an interesting area to experiemnt in and think a little more
> about.

Now I got the point. I wonder what if we leave it as it is. How much will
it affect the code?

> How does this commit_pos work again? I confess I have forgetten since I
> wrote some of this code a while ago... :-).

It is using recursive strategy. The first call to `stored_bitmap_for_commit`
function do not have `pos_hint`. So, it uses `bitmap_table_lookup` to find
the commit position in the list and makes a call to `lazy_bitmap_for_commit`
function. This function gets the offset and xor-offset using the commit id's
position in the list. If xor-offset exists, it is using this xor-offset to
get the xor-bitmap by calling `stored_bitmap_for_commit` again. But this time
`pos_hint` is xor-offset. This goes on till the last non-xor bitmap has found.

As I said before, xor-offset should be an absolute value to make it work
correctly.

> Should we print this regardless of whether or not there is a lookup
> table? We should be able to learn the entry count either way.

No, this is necessary. "Bitmap v1 test (%d entries loaded)" means
all the bitmap entries has been loaded. It is basically for 
`load_bitmap_entries_bitmap_v1` function which loads all the bitmaps
One by one. But if there is a lookup table, `prepare_bitmap_git`
function will not load every entries and thus printing the above
line is wrong.

Thanks :)