Re: [PATCH v3] repack: enable bitmaps by default on bare repos

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/23/2019 7:30 AM, Jeff King wrote:
> On Wed, Apr 10, 2019 at 06:57:21PM -0400, Jeff King wrote:
> 
>>   2. The answer we get from a bitmap versus a regular traversal are not
>>      apples-to-apples equivalent. The regular traversal walks down to
>>      the UNINTERESTING commits, marks the boundaries trees and blobs as
>>      UNINTERESTING, and then adds in all the interesting trees and blobs
>>      minus the UNINTERESTING parts. So it can sometimes give the wrong
>>      answer, claiming something is interesting when it is not.
>>
>>      Whereas with bitmaps we fill in the trees and blobs as we walk, and
>>      you get the true answer. But it means we may open up a lot more
>>      trees than the equivalent traversal would.
>>
>>      So one thing I've never really experimented with (and indeed, never
>>      really thought about until writing this email) is that the bitmaps
>>      could try to do that looser style of traversal, knowing that we
>>      might err on the side of calling things interesting in a few cases.
>>      But hopefully spending a lot less time opening trees.
>>
>>      I'm not even 100% sure what that would look like in code, but just
>>      thinking about it from a high level, I don't there's a particular
>>      mathematical reason it couldn't work.
> 
> I spent a while thinking and experimenting with this tonight. The result
> is the patch below. Ævar, do you still have a copy of the repo that
> misbehaved? I'd be curious to see how it fares.

This patch caught my attention, and I think I understand some of the issues
at hand. I'm not well-versed specifically in Git's bitmap implementation.
The fundamental ideas are there, but my perspective is biased by my own
independent bitmap implementation for Azure Repos. What worked there may not
work at all here.

> Finding the right trees to explore is a little tricky with bitmaps.  In
> a normal traversal, we consider the "edges" to be worth exploring.
> I.e., the places where an UNINTERESTING commit is the parent of an
> interesting one.

This is the "commit frontier" which I bring up again below.

> But with bitmaps, we don't have that information in the same way,
> because we're trying to avoid walking the commits in the first place. So
> e.g., in a history like this:
> 
>   A--B--C
>       \
>        D
> 
> Let's imagine we're computing "C..D", and that D has a bitmap on disk
> but C does not.

(As I read your discussion below, I'm confused. For "C..D", C is a have
and D is a want. We should explore all the haves _first_, then walk the
wants, right?)

> When we visit D, we'll see that it has a bitmap, fill in
> the results in our cumulative "want" bitmap, and then stop walking its
> parents (because we know everything they could tell us already).

I may be misreading something, but we would construct _a_ bitmap for C
by walking its reachable objects until hitting a bitmap we know about.
Perhaps A or B have a bitmap to start the construction. If we never
construct a bitmap for C, then we don't know what to remove from the "D"
bitmap to show the difference.

If "C" is not even in the bitmap pack, then we don't use bitmaps for
this calculation because of this condition:

        /*
         * if we have a HAVES list, but none of those haves is contained
         * in the packfile that has a bitmap, we don't have anything to
         * optimize here
         */
        if (haves && !in_bitmapped_pack(bitmap_git, haves))
                goto cleanup;

> Then we visit C. It's not bitmapped, so we keep walking. Unlike a normal
> traversal, we can't stop walking even though there are only
> UNINTERESTING commits left, because we have to fill the complete bitmap,
> including A and B (and you can imagine there might be thousands of
> ancestors of A, too). The reason is that we skipped adding ancestors of
> D when we saw its bitmap, so no still_interesting() cutoff will work
> there.
> 
> But how do we know that it's worth looking into the tree of "B"? If we
> assume we visit commits before their ancestors (because of the commit
> timestamp ordering), then we'd see that "B" is in the "want" bitmap
> already, which makes it worth visiting its tree.
> 
> Unfortunately we'd then continue on to "A", and it would _also_ look
> interesting, because it's also in the "want" bitmap. We don't have an
> easy way of knowing that "A" is basically already covered by "B", and is
> therefore not worth pursuing. In this example, we could pass down a bit
> from B to A as we traverse. But you can imagine another interesting
> commit branched from A, which _would_ make A's tree worth considering.
> 
> Fundamentally, as soon as we load a bitmap and stop traversing, we lose
> all information about the graph structure.
> 
> So the patch below just looks at every tree that might be worth
> exploring (so both "A" and "B" here, but not "C"). That shouldn't be any
> _worse_ than what the current code does (because it looks at all the
> trees). It appears to behave correctly, at least so far as passing the
> test suite. Running p5310 and p5311 against git.git and linux.git, it
> seems to perform worse. I'm not quite sure why that is (i.e., if I
> screwed up something in the implementation, or if the algorithm is
> fundamentally worse).

I'm of the opinion that the old method was better. It followed a very clear
three-step process:

 1. Compute the "haves" bitmap.

 2. Compute the "wants" bitmap, but don't explore into the "haves" bitmap.

 3. Subtract the "haves" bitmap from the "wants" (in case we added bits to
    the wants before they were in the haves).

But there is hope in your idea to improve some cases. As noted, we give up
if all of the haves are not in the bitmapped pack. By starting with a
commit walk, we can find the "commit frontier," which is the set of commits
that are explored by paint_down_to_common(). In this case, the set {B, C, D}.

After walking commits and finding a set of UNINTERESTING commits, we could
look for any UNINTERESTING commits that have bitmaps (or simply are in the
bitmapped pack) and use those to see our "haves" bitmap. So, if B has a
bitmap, but C is too new for the bitmap, then we can still create the haves
based on B and then take a bitmap diff "D minus B".

In fact, using the "merge base set" for the diff reflects what the non-bitmap
algorithm does in this case. It ignores C and only explores B.

I learned a lot looking at both versions of this method side-by-side. Thanks!

-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux