Re: [GSoC][RFC][PROPOSAL] Reachability bitmap improvements

Abhradeep Chakraborty <chakrabortyabhradeep79@xxxxxxxxx> · Thu, 14 Apr 2022 13:28:46 +0530

Taylor Blau <me@xxxxxxxxxxxx> wrote:

> Thanks so much for submitting this proposal! I have been excited to look
> through it, and am sorry for not getting to it sooner. Let me take a
> look now and give some of my thoughts.

Thanks so much Taylor for reviewing my proposal. It's my pleasure!

> Great! It sounds like working on Git would give you some new experience
> in a different domain than some of your past projects, and help you
> learn more about how it works and is developed.

Yeah, definitely! Looking forward to explore more in this domain :)

> Terrific! I am really glad that the MyFirst... documents were helpful
> and made it easier for you to contribute. The ProGit book is a great
> resource, too.

Thanks :)

> If you are looking for more resources, I would encourage you to search
> around for blog posts written by Git contributors, particularly related
> to reachability bitmaps (at least for this GSoC project). Some helpful
> places to start there would be:
>
>     https://github.blog/2015-09-22-counting-objects/
>     https://github.blog/2021-04-29-scaling-monorepo-maintenance/

I have read the "counting objects" article already. It is a very good
article to understand the reachability bitmaps in git. I would surely
go through the other article you shared.

> Note that commits and trees can be stored as deltas against other
> commits and trees (which themselves might be packed as deltas) and so
> on. But we only consider a pair of objects to be delta candidates for
> one another when they share a common type (e.g., we would never delta a
> tree object against a blob or vice-versa).

Oops. That is new to me. I didn't know that commits and trees can also
be delta-ed. Thanks for this info!!

> Exactly; though note that since that series we now have multi-pack
> bitmaps, too, which use a slightly different ordering called the
> "pseudo-pack" order, which (more or less) looks like all of the objects
> in a MIDX first sorted by which pack they came from, and then sorted by
> their order within the pack, removing duplicates in favor of the
> "preferred" pack.
>
> More thorough documentation on this can be found in the
> "multi-pack reverse indexes" section of
> "Documentation/technical/pack-format.txt".

Thanks. Will look into it :)

> But I want to be careful about the word "slow" here. They might be slow,
> or it might be really fast to decompress a given bitmap, depending on
> the disk performance, CPU pressure, how large the bitmaps are, and so
> on. So I think a good first step here is to validate our assumption that
> EWAH decompression is even slow to begin with.
>
> (Of course, the literature on newer formats will readily tell you that
> the old stuff is slower, but I think it's worth reevaluating those
> statements in the context of a practical application instead of in
> theory / benchmarks).

I think, it is "conceptually/theoritically" slow approach. Because as
you said -
> we can't begin to interpret the bitmaps until after decompressing them

It has to decompress the whole thing instead of evaluating the target
bitmap. Hardwares can make it fast but theoritically the time complexity
is slow.

I will validate whether it is practically slow or not though.

> Same comments here about whether or not this is slow to begin with. From
> my experimental patches in this area earlier, I found that we could get
> a significant speed-up in some cases, but that the speed-up was
> basically obliterated whenever we had to do any follow-on traversal if
> the bitmap coverage wasn't completely up-to-date.

True. But atleast it is faster(hopefully) when bitmap coverage is up-to-date.
We may optimize/improve it to make it fast overall.

> Arguably we could do this whether or not we solve the above, but doing
> some combination of the above may make it easier (e.g., if we wanted to
> change the bitmap selection to store dramatically more commits, then we
> may want to investigate how much of a bottleneck the sequential read
> requirement of .bitmap files would become for different numbers of
> selected commits).

True.

> For bitmaps, the number one thing we care about is correctness. I have
> never thought about using Bloom filters before; even though the
> one-sided nature of their errors means that we would never forget to
> send an object that we should have, having an absolute result is vastly
> preferred.
>
> After that, I think we mostly care about how quickly they can be
> decompressed. And after that, I think we care about things like "how
> fast can they be written", and "how large are they".

That's obvious. I had included those probabilistic approaches because
I thought we could have a check whether the obtained result is right
or wrong. In case of wrong answers (2% chance for Hyperloglog), we can
follow another approach that alters only the wrong results. But I
think, it will become complex. More research is needed to check whether
it can be done or not.

For other approaches ( i.e. Roaring+run and sroar), I think these two
fits the criteria you listed in your comment. Though, I need to work
more to fully understand the sroar methodology.

> This all sounds very good and ambitious. Keep in mind, though, that
> these projects have a tendency to take much more time than expected ;-).
> If we get one done, I'll be thrilled! (The goal with suggesting a few
> potential directions to go in is not to say: "let's do all of these
> things", but rather to give you some options in terms of what you find
> most interesting and exciting to work on).
>
> So I think it makes sense to try and find a way you can dip your toes
> into 2-3 of the sub-projects and get a sense for what feels most doable
> and interesting, then focus on that before allocating time for more
> projects in the future.
>
> > ## Estimated Timeline
>
> Like I said, I'm fine if you spend your entire GSoC project working on
> just one of these projects. So I don't want to get hung up on specific
> timelines just yet. If you end up working on this, I would suggest
> budgeting a week or so to try out a couple of different sub-projects,
> and then decide where to spend your time from there.

If that is the case, then I would go for the first sub-project i.e.
"Think of an better alternative of EWAH". It interests me a lot. Though
I will do what you suggested i.e. "try and find a way you can dip your
toes into 2-3 of the sub-projects and get a sense for what feels most
doable and interesting".

Should I remove "Estimated Timeline" ( or/and point 2, 3 from "The Plan"
section) for now then?

> Thanks very much for your interest and the time you spent putting this
> proposal together. I hope that some of my comments were helpful in
> refining it, and that you didn't mind my slow response too much.

The comments and suggestions are indeed helpful! I even didn't mind
anything.

Thanks :)