Re: [PATCH 2/6] pack-bitmap: prepare to read lookup table extension

Taylor Blau <me@xxxxxxxxxxxx> · Wed, 22 Jun 2022 12:49:43 -0400

On Tue, Jun 21, 2022 at 05:22:12PM +0530, Abhradeep Chakraborty wrote:
> Taylor Blau <me@xxxxxxxxxxxx> wrote:
> > Yeah. The problem here is that we can't record every commit that
> > _doesn't_ have a bitmap every time we return NULL from one of these
> > queries, since there are arbitrarily many such commits that don't have
> > bitmaps.
> >
> > We could approximate it using a Bloom filter or something, and much of
> > that code is already written and could be interesting to try and reuse.
> > But I wonder if we could get by with something simpler, though, which
> > would cause us to load all bitmaps from the lookup table after a fixed
> > number of cache misses (at which point we should force ourselves to load
> > everything and just read everything out of a single O(1) lookup in the
> > stored bitmap table).
> >
> > That may or may not be a good idea, and the threshold will probably be
> > highly dependent on the system. So it may not even be worth it, but I
> > think it's an interesting area to experiemnt in and think a little more
> > about.
>
> Now I got the point. I wonder what if we leave it as it is. How much will
> it affect the code?

I'm not sure, and I think that it depends a lot on the repository and
query that we're running.

I'd imagine that the effect is probably measurable, but small. Each hash
lookup is cheap, but if there are many such lookups (a large proportion
of which end up resulting in "no, we haven't loaded this bitmap yet" and
then "...because no such bitmap exists for that commit") at some point
it is worth it to fault all of the commits that _do_ have bitmaps in and
answer authoritatively.

In other words, right now we have to do two queries when an commit
doesn't have a bitmap stored:

  - first, a lookup to see whether we have already loaded a bitmap for
    that commit

  - then, a subsequent lookup to see whether the .bitmap file itself has
    a bitmap for that commit, but we just haven't loaded it yet

If we knew that we had loaded all of the bitmaps in the file, then we
could simplify the above two queries into one, since whatever the first
one returns is enough to know whether or not a bitmap exists at all.

> > How does this commit_pos work again? I confess I have forgetten since I
> > wrote some of this code a while ago... :-).
>
> It is using recursive strategy. The first call to `stored_bitmap_for_commit`
> function do not have `pos_hint`. So, it uses `bitmap_table_lookup` to find
> the commit position in the list and makes a call to `lazy_bitmap_for_commit`
> function. This function gets the offset and xor-offset using the commit id's
> position in the list. If xor-offset exists, it is using this xor-offset to
> get the xor-bitmap by calling `stored_bitmap_for_commit` again. But this time
> `pos_hint` is xor-offset. This goes on till the last non-xor bitmap has found.

Ahhh. Thanks for refreshing my memory. I wonder if you think there is a
convenient way to work some of this into a short comment to help other
readers in the future, too.

> As I said before, xor-offset should be an absolute value to make it work
> correctly.

Yep, makes sense.

> > Should we print this regardless of whether or not there is a lookup
> > table? We should be able to learn the entry count either way.
>
> No, this is necessary. "Bitmap v1 test (%d entries loaded)" means
> all the bitmap entries has been loaded. It is basically for
> `load_bitmap_entries_bitmap_v1` function which loads all the bitmaps
> One by one. But if there is a lookup table, `prepare_bitmap_git`
> function will not load every entries and thus printing the above
> line is wrong.

Right, that part makes sense to me. But I wonder if we should still
print something, perhaps just "Bitmap v1 test" or "Bitmap v1 test (%d
entries)" omitting the "loaded" part.

Thanks,
Taylor