RE: [PATCH v3] checkout: eliminate unnecessary merge for trivial checkout

"Ben Peart" <peartben@xxxxxxxxx> · Mon, 19 Sep 2016 09:18:08 -0400

Let me see if I can better explain what I?m trying to accomplish with this
patch.  

"git checkout -b foo" (without -f -m or <start_point>) is defined in the
manual as being a shortcut for/equivalent to:

        (1a) "git branch foo"
        (1b) "git checkout foo"

However, it has been our experience in our observed use cases and all the
existing git tests, that it can be treated as equivalent to:

        (2a) "git branch foo"
        (2b) "git symbolic-ref HEAD refs/heads/foo"

That is, the common perception (use case) is to just create a new branch
"foo" (pointing at the current commit) and point HEAD at it WITHOUT making
any changes to the index or worktree.

However, the (1b) command has "git reset" connotations in that it should
examine and manipulate the trees, index, and worktree in the expectation
that there MIGHT be work to do.

Since this additional work in (1b) takes minutes on large repos and (2b)
takes less than a second, my intent was to identify the conditions that this
additional work will have no affect and thereby avoid it.

Alternatively, was the "-b" option just created as a shortcut only to avoid
calling the separate "git branch foo" command and we should not think about
the common perception and usage?

More comments inline...

> -----Original Message-----
> From: Junio C Hamano [mailto:gitster@xxxxxxxxx]
> Sent: Tuesday, September 13, 2016 6:35 PM
> To: Ben Peart <mailto:peartben@xxxxxxxxx>
> Cc: mailto:git@xxxxxxxxxxxxxxx; mailto:pclouds@xxxxxxxxx; Ben Peart
> <mailto:Ben.Peart@xxxxxxxxxxxxx>
> Subject: Re: [PATCH v3] checkout: eliminate unnecessary merge for trivial
> checkout
> 
> Ben Peart <mailto:peartben@xxxxxxxxx> writes:
> 
> > +static int needs_working_tree_merge(const struct checkout_opts *opts,
> > +      const struct branch_info *old,
> > +      const struct branch_info *new)
> > +{
> > +...
> > +}
> 
> I do not think I need to repeat the same remarks on the conditions in this
> helper, which hasn't changed since v2.  Many "comments" in the code do not
> explain why skipping is justified, or what they claim to check looks to me
just
> plain wrong.
> 
> For example, there is
> 
>        /*
>         * If we're not creating a new branch, by definition we're changing
>         * the existing one so need to do the merge
>         */
>        if (!opts->new_branch)
>                return 1;
> 
> but "git checkout" (no other argument) hits this condition.  It disables
the
> most trivial optimization opportunity, because we are not "creating".
> 

Disabling the optimization for "git checkout" with no argument was
intentional. This command does not create a new branch but instead, performs
a "soft reset" which will update the index and working directory to reflect
changes to the sparse-checkout (for example).  If this was not disabled,
many tests fail as they expect this behavior.  Because "git checkout" does
not actually change the refs, if we skipped the merge/index/working
directory update, this command becomes a no-op.

> "By definition, we're changing"?  Really?  Not quite.
> 

What I was attempting to communicate is that if we aren't creating a new
branch any changes or updates will happen in the existing branch.  Since
that could only be updating the index and working directory, we don't want
to skip those steps or we've defeated any purpose in running the command. 

> If you disable this bogus check, "git checkout" (no other argument) would
be
> allowed to skip the merge_working_tree(), and that in turn reveals another
> case that the helper is not checking when
> unpack_trees() MUST be called.
> 
>     Note: namely, when sparse checkout is in effect, switching from
>     HEAD to HEAD can nuke existing working tree files outside the
>     sparse pattern -- YUCK!  See penultimate test in t1011 for
>     an example.
> 
> This yuckiness is not your fault, but needs_working_tree_merge() logic you
> added needs to refrain from skipping unpack_trees() call when sparse thing
> is in effect.  I'd expect "git checkout -b foo"
> instead of "git checkout" (no other argument) would fail to honor the
sparse
> thing and reveal this bug, because the above bogus "!opts->new_branch"
> check will not protect you for that case.
> 

It is correct that this optimization will skip updating the tree to honor
any changes to the sparse-checkout in the case of creating a new branch.
Unfortunately, I don't know of any way to detect the changes other than
actually doing all the work to update the skip work tree bit in the index.
If this behavior is required, then this optimization will need to check  if
sparse-checkout is enabled and skip the optimization just in case there have
been changes.

> In other words, these random series of "if (...) return 1" are bugs hiding
> other real bugs and we need to reason about which ones are bugs that are
> hiding what other bugs that are not covered by this function.  As Peff
said
> earlier for v1, this is still an unreadable mess.  We need to figure out a
way to
> make sure we are skipping on the right condition and not accidentally
hiding
> a bug of failing to check the right condition.  I offhand do not have a
good
> suggestion on this; sorry.
> 

Beyond this code review process and testing, I don't know how else we make
sure we're caught all the conditions where we are OK skipping some of the
steps.  Any change has inherent risk - a change in behavior even more so.

> >  static int merge_working_tree(const struct checkout_opts *opts,
> >                                                struct branch_info *old,
> >                                                struct branch_info *new,
> >                                                int *writeout_error)
> >  {
> > +      /*
> > +      * Optimize the performance of "git checkout -b foo" by avoiding
> > +      * the expensive merge, index and working directory updates if
they
> > +      * are not needed.
> > +      */
> > +      if (!needs_working_tree_merge(opts, old, new))
> > +                      return 0;
> > +
> >          int ret;
> >          struct lock_file *lock_file = xcalloc(1, sizeof(struct
lock_file));
> 
> With the change you made at the beginning of this function, it no longer
> compiles with -Wdecl-after-stmt, but that is the smallest of the problems.

I apologize, I didn't realize this was a requirement.  It built and passed
all existing tests on Windows but I will reorder the declarations to prevent
causing issues with other platforms/compilers.

> 
> It is a small step in the right direction to move the call to the helper
from the
> caller to this function, but it is a bit too small.
> 
> Notice that the lines after the above context look like this:
> 
>             hold_locked_index(lock_file, 1);
>             if (read_cache_preload(NULL) < 0)
>                             return error(_("index file corrupt"));
> 
>             resolve_undo_clear();
>             if (opts->force) {
>                             ret = reset_tree(new->commit->tree, opts, 1,
> writeout_error);
>                             if (ret)
>                                             return ret;
>             } else {
>                             struct tree_desc trees[2];
>                             ...
> 
> I would have expected that the check goes inside the "else" thing that
> actually does a two-tree merge, and the helper loses the check with opts-
> >force, at least.  That would still be a change smaller than desired, but
at
> least a meaningful improvement compared to the previous one.  

I'll restructure it that way.

> As I have
> already pointed out, in the "else" clause there is a check "is the index
free of
> conflicted entries? if so error out", and that must be honored in
!opt->force
> case, no matter what your needs_working_tree_merge() says.  

Given we're not merging trees, updating the index, or work tree, why do we
need to error out in this case?  We aren't attempting this optimization if
they pass "-m."  If there are conflicted entries that haven't been fixed,
they will still exist.  We're essentially just creating a new reference for
the existing commit/index/work tree.

> I also was
> hoping that you would notice, when you were told about the unmerged
> check, by reading the remainder of the merge_working_tree(), that we need
> to call show_local_changes() when we are not doing force and when we are
> not quiet---returning early like the above patch will never be able to
call that
> one downstream in the function.

It is a good point that my optimization skipped the call to
show_local_changes.  Thanks for catching that, I've fixed it for my next
iteration.

> 
> Regardless of what the actual checks end up to be, the right place to do
this
> "optimization" would look more like:
> 
>  builtin/checkout.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/builtin/checkout.c b/builtin/checkout.c index
2b50a49..a6b9e17
> 100644
> --- a/builtin/checkout.c
> +++ b/builtin/checkout.c
> @@ -508,14 +508,19 @@ static int merge_working_tree(const struct
> checkout_opts *opts,
>                                             topts.dir->flags |=
DIR_SHOW_IGNORED;
>                                            
setup_standard_excludes(topts.dir);
>                             }
> +
> +                          if ( we know we can skip the unpack ) {
> +                                          ret = 0;
> +                          } else {
>                                             tree =
parse_tree_indirect(old->commit ?
> 

                  old->commit-
> >object.oid.hash :
> 

                  EMPTY_TREE_SHA1_BIN);
>                                             init_tree_desc(&trees[0],
tree->buffer, tree->size);
>                                             tree =
parse_tree_indirect(new->commit-
> >object.oid.hash);
>                                             init_tree_desc(&trees[1],
tree->buffer, tree->size);
> -
>                                             ret = unpack_trees(2, trees,
&topts);
> +                          }
> +
>                             if (ret == -1) {
>                                             /*
>                                              * Unpack couldn't do a
trivial merge; either
> 

I'll restructure it to be like you suggest above however, given we will not
be merging the tress, we won't have any index changes to write out. I will
also skip the calls to cache_tree_update and write_locked_index.

> I'd think.  Note that the determination of "we can skip" would involve
> knowing the object names of the two trees involved, so for performance
> reasons, some of the parse-tree calls may have to come before the call to
> "do we know we can skip?", but that does not fundamentally change the
> basic code structure.
> 
> Thanks.

I don't understand why we'd need to know the object names of the two trees
given we have the IDs.  What did you have in mind that would need those?