Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 8/18/2018 9:44 PM, Elijah Newren wrote:
On Fri, Aug 17, 2018 at 5:41 AM Ben Peart <peartben@xxxxxxxxx> wrote:
On 8/16/2018 2:37 PM, Duy Nguyen wrote:
On Thu, Aug 16, 2018 at 8:27 PM Ben Peart <Ben.Peart@xxxxxxxxxxxxx> wrote:

From: Ben Peart <Ben.Peart@xxxxxxxxxxxxx>

Skip merging the commit, updating the index and working directory if and
only if we are creating a new branch via "git checkout -b <new_branch>."
Any other checkout options will still go through the former code path.

If sparse_checkout is on, require the user to manually opt in to this
optimzed behavior by setting the config setting checkout.optimizeNewBranch
to true as we will no longer update the skip-worktree bit in the index, nor
add/remove files in the working directory to reflect the current sparse
checkout settings.

For comparison, running "git checkout -b <new_branch>" on a large repo takes:

14.6 seconds - without this patch
0.3 seconds - with this patch

I still don't think we should do this. If you want lightning fast
branch creation, just use 'git branch'. From the timing breakdown you
shown in the other thread it looks like sparse checkout still takes
seconds, which could be optimized (or even excluded, I mentioned this
too). And split index (or something similar if you can't use it) would
give you saving across the board. There is still one idea Elijah gave
me that should further lower traverse_trees()  cost.


We have investigated some of these already - split index ended up
slowing things down more than it sped them up do to the higher compute
costs.  Sparse checkout we've already optimized significantly - limiting
the patterns we accept so that we can do the lookup via a hashmap
instead of the robust pattern matching.  We will continue to look for
other optimizations and appreciate any and all ideas!

In the end, this optimization makes a huge performance improvement by
avoiding doing a lot of work that isn't necessary.  Taking a command
from 14+ seconds to sub-second is just too much of a win for us to ignore.

But anyway, it's not my call. I'll stop here.

It's even less of my call, but since things seem to be stuck in
what-should-we-do state (as per Junio's comments on this patch in the
last two "What's cooking" emails), and since Ben and Duy obviously
have opposite opinions on Ben's patch, let's see if I might be able to
help at all.  Here's my summary and my findings:

== The pain ==
- For repositories with a really large number of entries (500K as Ben
says), some operations are much slower than it feels like they should
be.
- This does not seem to be GFVS-specific in any way, I can duplicate
slowness with a simple git-bomb[1]-like repo that has a sparse
checkout pattern ignoring the "bomb" side.  (It has 1M+1 entries in
the index, and .git/info/sparse-checkout ignores the 1M so the working
copy only has 1 entry).  The timings on my repo for "git checkout -b
$NEWBRANCH" are almost exactly double what Ben reports he gets on
their repo.

[1] https://kate.io/blog/making-your-own-exploding-git-repos/

== Short term solutions ==
- Alternative git commands exist today to do a fast checkout of a new
branch in a huge repo.  I also get sub-second timings in my
even-bigger repo with this:
    git branch $NEWBRANCH && git symbolic-ref HEAD $NEWBRANCH
But I do understand that wrapping this into a script or executable
(git-fast-new-branch?) and asking users to use it is a usability
problem and an uphill battle.  (Sidenote: this isn't quite the same
operation; it's missing a reflog update.  The -m option to
symbolic-ref doesn't seem to help; I guess the fact that HEAD's
sha1sum is not changing is viewed as not an update?  However, that
could also be scripted around.)
- Ben's patch successfully drops the time for "git checkout -b
$NEWBRANCH" from 26+ seconds (in my cooked-up testcase) to sub-second
(in fact, under .1 seconds for me).  That's a _huge_ win.

== unpack_trees optimization notes ==
- Ben's patch is extremely focused.  It only affects "git checkout -b
$NEWBRANCH".  If someone runs "git branch $NEWBRANCH && git checkout
$NEWBRANCH", they get the old 26+ second timing.  They also suddenly
get the really long timing if they add any other flags or checkout a
commit that differs in only a single entry in the entire tree.  It
would be nice if we did general optimization for all issues rather
than just special casing such narrow cases.
- However, optimizing unpack_trees is hard.  It's really easy to get
lost trying to look at the code.  Time has been spent trying to
optimizing it.  Ben really likes the speedup factors of 2-3 that Duy
has produced.  But he's pessimistic we'll find enough to bridge the
gap for this case.  And he's worried about breaking unrelated stuff
due to the complexity of unpack_trees.
- Duy is pretty sure we can optimize unpack_trees in at least one more
way.  I've tried looking through the code and think there are others,
but then again I'm known to get lost and confused in unpack_trees.

== The patch ==
- Ben's patch only affects the "checkout -b $NEWBRANCH" case.  He
checks for it by looking for any other flag that would be a different
case, and using the old codepath if he finds any.
- This means there is a "giant list of checks" for this optimization,
and an ongoing maintenance burden because if anyone ever adds any
extra options, this optimization might suddenly break things if that
giant list of checks isn't updated.  Ben even added a comment to the
code hoping to help alert others who come along and add extra options:
+ /*
+ * If new checkout options are added, skip_merge_working_tree
+ * should be updated accordingly.
+ */

== Other notes ==
- In my cooked-up testcase, I also noticed that things like git add or
git status were slow in a repo with lots of index entries.  There may
be several other areas in need of performance boosts too.


== My opinions ==
- The performance wins are big enough that I can see why Ben is pushing this.
- I totally see Duy's point that more general optimizations would be
really nice.
- I really dislike the "giant list of checks" and the ongoing
maintenance burden it implies.

Overall, I have to side with Duy and say I don't think we should take
the patch as-is.  Since that'll be frustrating for Ben to hear, I've
tried to come up with some additional alternatives:

Thank you for the thorough review and summary. From my perspective, it is both fair and accurate.

This is a performance improvement we've been using for nearly 2 years [1] because the win is so significant in our case, it makes living with the maintenance burden worth while. That said, in the past 2 years, there haven't been any new options added to the checkout command that required any maintenance but the possibility clearly does exist.

My purpose in pushing this patch to the mailing list is to improve git in general and to let others benefit from the performance improvement we've been enjoying. If the community doesn't feel like the risk/reward is worth it, then it shouldn't be accepted and I'm perfectly fine with that.

[1] https://github.com/git-for-windows/git/commit/0a64b9ae6df02d4df3e41227c43c4217a1a90bfa


== Alternatives (both short and long term) ==
- Use the 'git branch $NEWBRANCH && git symbolic-ref HEAD $NEWBRANCH'
trick.  It's essentially just as fast (well, you exec git twice so
it's slightly slower, but it's close).  However, there's a difficult
get-it-to-the-users hurdle.
- Rewrite this patch so it instead does a very small set of checks at
the beginning of cmd_checkout(); e.g. check if argc == 3 and argv[1]
== "-b" and if so then perform the minimum operations needed to create
and checkout the new branch (maybe even calling in to cmd_branch() and
cmd_symbolic_ref() and calling some reflog update function).  Preface
it with a comment that it's a performance hack that might eventually
be able to go away.

I'm happy to do this if it would make the patch better/more acceptable to the community.

One other change I could do would be to address Duy's concern about the config option would be to remove that entirely. That means those using sparse-checkout don't have the option of getting the performance win but the benefit is that there isn't any documentation or behavior changes that need to be made - things just get a lot faster if you happen to use the (rather common) "checkout -b" option.

If making these two changes would resolve the remaining concerns, I'm happy to make them and send another patch series. If it will just result in another round of "I'm still not comfortable with this" then I'd rather spend my time elsewhere. :)

- Look into the performance optimization(s) Duy mentioned.
- Other performance optimizations in unpack_trees(); it seems we have
an awful lot of loops over all cache entries, and it sure seems to me
like some of them could somehow be focused on just the things that are
changing instead of again checking everything.  Or maybe I'm wrong and
they're all needed.

I am absolutely in favor of any performance optimizations that can be made in unpack_trees(). I am not very familiar with that code and only recently started digging in when I started my recent RFC/patch series to multi-thread it [2]. I'm extremely grateful for all the ideas and work that thread has triggered - to me, it's an example of open source software development at it's finest.

I'd never even thought of using the cache-tree to simplify the traversal in that way and am grateful others have and especially to Duy who backed it up with a great patch series (I started the work but he was a lot faster at getting working patches than I was!). In my opinion, this was a much better solution than the complexity of multi-threading this complex piece of code (though that remains an option if it is still needed).

[2] https://public-inbox.org/git/20180718204458.20936-1-benpeart@xxxxxxxxxxxxx/

- Possibly crazy idea for massive global performance win: modify how
the cache works so that it can have tree entries instead of just file
entries; make use of that with sparse checkouts (and partial clones?)
so that the cache only stores a tree when that entire tree is
"unwanted".  Suddenly, git status, git add, git checkout, etc., etc.
are all much faster.  merge needs some special work to operate,
though.


Not crazy at all! It's something we've investigated and discussed several times over the past couple of years. This is the largest potential optimization that I am aware of as it has the potential to dramatically speed up most git operations as they become O(what I care about) vs O(size of repo/tree) - especially when paired with partial clone.

The downside is pretty obvious, pulling this off means rewriting an awful lot of the code! The index is central to so much of git and the assumption is everywhere that the index is simply a (global) array of cache entries to be looped through and manipulated at will.

There would be a lot of refactoring work, creating a new index abstraction API, updating the code to utilize it, then updating the internal representation of the index to optimize it around the concept of being sparse/partial. Likely, we'd want/need a new on disk format that would support the new structures in an optimal way. Finally, the various algorithms (like unpack_trees()) could be rewritten to take advantage of the sparseness.

Not something that can be taken on lightly - but certainly something that has huge potential to enable git to scale to much larger repos.


I'm potentially interested in eventually looking into and/or helping
with optimization work in this area.  I might even get more time to
work on git at $DAYJOB since sparse-checkouts/partial-clone/gvfs/etc.
all are looking more and more interesting as we have large repos that
are growing quickly (particularly as we consolidate many repos down to
fewer).  But they've said that such work won't take priority for
several months down the road for them, it's not yet clear how much
they'd sponsor vs. just want to see what existed and how we can use
it, and I'd kind of like to clear a few other things off my plate
first too (such as the merge-recursive rewrite).

Anyway, that's just my $0.02.  I hope something I said helps.

Elijah




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux