On 4/24/2023 8:00 PM, Taylor Blau wrote: > When reachability bitmap coverage exists in a repository, Git will use a > different (and hopefully faster) traversal to compute revision walks. Before getting into the code... > Consider a set of positive and negative tips (which we'll refer to with > their standard bitmap parlance by, "wants", and "haves"). In order to > figure out what objects exist between the tips, the existing traversal > in prepare_bitmap_walk() looks something like: > 4. Construct a reachability bitmap for the "haves" side by walking > from the revision tips down to any existing bitmaps, OR-ing in any > bitmaps as they are found. > > 5. Then do the same for the "wants" side, stopping at any objects that > appear in the "haves" bitmap. One important thing about this older algorithm is that it will avoid including trees and blobs that are reachable from the "haves" that are not reachable from the boundary between "haves" and "wants". The most obvious case is that someone force-pushed a branch after rewording the commit messages but keeping the trees and blobs identical. This case is rare, and usually the objects that would be repeated are low enough in count and size (with deltas) that it's not a big deal, but it is worth bringing up. I didn't see a reference to this side of the difference in your message. > One of the benefits, however, is that even if the walk is slower, bitmap > traversals are guaranteed to provide an *exact* answer. Unlike the > traditional object traversal algorithm, which can over-count the results > by not opening trees for older commits, the bitmap walk builds an exact > reachability bitmap for either side, meaning the results are never > over-counted. So the new algorithm is a new, third state of "not overcounting things reachable beyond the commit-walk boundary" and "overcounting things reachable from the wants..haves commit region". > But producing non-exact results is OK for our traversal here (both in > the bitmap case and not), as long as the results are over-counted, not > under. True, our response will not create a pack-file that cannot be applied locally, but it might be larger than required. For these reasons, I'm surprised that this patch completely replaces the old algorithm for the new one. I would prefer to see a config option that enables this new algorithm. It would be safer to deploy that way, too. > The new result performs significantly better in many cases, particularly > when the distance from the boundary commit(s) to an existing bitmap is > shorter than the distance from (all of) the have tips to the nearest > bitmapped commit. Aside: we may even gain benefits by adjusting the commits selected for bitmaps by trying for this data shape. One way to attempt this is to not focus on refs but on history common to "most refs" using first- parent history as a good heuristic. If a commit appears in a lot of first-parent histories, then it is more likely to be helpful to these kinds of walks. > Note that when using the old bitmap traversal algorithm, the results can > be *slower* than without bitmaps! Under the new algorithm, the result is > computed faster with bitmaps than without (at the cost of over-counting > the true number of objects in a similar fashion as the non-bitmap > traversal): > > # (Computing objects unique to 'master' among all branches, not > # using bitmaps). > $ time git rev-list --count --objects master --not --exclude=master --branches > 54 I like how you included the output of --count here. It might be interesting to demonstrate the different forms of overcounting in a carefully constructed test case, which would help us be sure that we are using the desired algorithm during a test case. > real 0m1.012s (no bitmaps) > real 0m29.571s (bitmaps, old) > real 0m2.279s (bitmaps, new) > The new algorithm is still slower than not using bitmaps at all, but it > is nearly a 13-fold improvement over the existing traversal. > In a more realistic setting (using my local copy of git.git), I can > observe a similar speed-up: ... > Here, the new algorithm is also still slower than not using bitmaps, but > represents a 4-fold improvement over the existing traversal in a more > modest example. Interesting that the new bitmap algorithm is worse in all presented cases. How "atypical" do we need to be in order for bitmaps to outperform the tree-parsing algorithm? Do we need to try "--branches --not master~1000" to replicate a stale single-branch clone getting every latest commit? (I'll leave code review to a separate reply.) Thanks, -Stolee