Re: [PATCH 1/1] commit-graph.c: avoid unnecessary tag dereference when merging

Taylor Blau <me@xxxxxxxxxxxx> · Sun, 22 Mar 2020 09:44:39 -0600

On Sun, Mar 22, 2020 at 01:49:16AM -0400, Jeff King wrote:
> On Sat, Mar 21, 2020 at 08:23:13PM -0400, Derrick Stolee wrote:
>
> > On 3/21/2020 8:20 PM, Taylor Blau wrote:
> > > On Sat, Mar 21, 2020 at 08:03:01PM -0400, Derrick Stolee wrote:
> > >> On 3/21/2020 2:50 PM, Junio C Hamano wrote:
> > >>> Do we need to worry about INFO_QUICK and SKIP_FETCH_OBJECT in this
> > >>> codepath, by the way?
> > >>
> > >> I was coming back to this thread to bring up these exact flags for
> > >> consideration. The good news is that in a partial clone with any
> > >> amount of filtering we will still have all reachable commits, which
> > >> are necessary for the commit-graph to make sense. The only ones that
> > >> would fail has_object_file() are ones removed by GC, but they may
> > >> still exist on the remote. So without SKIP_FETCH_OBJECT, we would
> > >> generate a network call even if the server has GC'd to remove the
> > >> commits. This gets particularly bad when the server returns all
> > >> reachable objects from that commit!
> > >
> > > That makes sense. Do you think something like this should be applied?
> > > +       int flags = OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT;
> > [...]
> >
> > Yes, I think this is the appropriate update. Thanks!
>
> I'm not so sure.
>
> In a non-partial clone, when would expect QUICK to matter? If we find
> the object is missing, then either:
>
>   - we are racing with repack, and we would benefit from double-checking
>     that the object really is gone
>
>   - we used to have it (since it was mentioned in the graph file) but it
>     went away
>
> Using QUICK means we won't waste time double-checking in the second
> case. But it means we won't catch the first case, and we may generate a
> new graph file that omits the object. They're both optimizations, and I
> don't think we're impacting correctness[1], but it's not clear to me
> whether one is a win over the other. We don't generally expect objects
> we have to go away often.
>
> Skipping fetching seems a little more straight-forward to me. If we had
> it in a graph file, presumably we had the actual object before, too. And
> either we're in the first case above (we really do have it and just need
> to double-check) in which case not saying QUICK would be enough. Or we
> intentionally got rid of it. In which case downloading it just to
> generate a cache is quite silly.

I was going to write that I'm not entirely sure of this, but I tried to
talk myself through it below, and I think that the right flag is *only*
OBJECT_INFO_SKIP_FETCH_OBJECT.

First, I agree with your reasoning that we shouldn't have
OBJECT_INFO_QUICK, that much makes sense to me.

To consider OBJECT_INFO_SKIP_FETCH_OBJECT, let's say that we used to
have some commit, and it got GC'd away. This shouldn't happen unless a
commit is unreachable, so if we ever find ourselves in a situation where
the parent of some other commit is missing, we know that that represents
a true corruption, not the result of a normal 'git gc'.

Now, if we do find ourselves in this case, we know that a 'git
commit-graph write --input=none --split' *would* work because we would
find that the unreachable commit (if it were in the graph) was missing,
and 'OBJECT_INFO_SKIP_FETCH_OBJECT' would tell us not to go looking for
it. Likewise, if the commit weren't already in the graph, this would
work fine, too.

But, most of the time we won't even get a chance to write a new
commit-graph based on existing ones, since 'gc.writeCommitGraph' will
take care of this for us in-process with 'git gc'. This uses
'--reachable' without '--split', so we won't ever invoke
'merge_commit_graph{s,}'.

> So it seems like SKIP_FETCH_OBJECT but _not_ QUICK might be reasonable
> here.

I agree with your original reasoning that OBJECT_INFO_QUICK is the wrong
choice, but I think my reasoning above that
OBJECT_INFO_SKIP_FETCH_OBJECT *is* an appropriate flag is sound.

> -Peff
>
> [1] I'm actually not quite sure about correctness here. It should be
>     fine to generate a graph file without any given commit; readers will
>     just have to load that commit the old-fashioned way. But at this
>     phase of "commit-graph write", I think we'll already have done the
>     close_reachable() check. What does it mean to throw away a commit at
>     this stage? If we're the parent of another commit, then it will have
>     trouble referring to us by a uint32_t. Will the actual writing phase
>     barf, or will we generate an invalid graph file?

Thanks,
Taylor