Re: [PATCH v3] checkout: optimize "git checkout -b <new_branch>"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 20, 2018 at 6:40 AM Ben Peart <peartben@xxxxxxxxx> wrote:
> On 8/18/2018 9:44 PM, Elijah Newren wrote:
...
> > == My opinions ==
> > - The performance wins are big enough that I can see why Ben is pushing this.
> > - I totally see Duy's point that more general optimizations would be
> > really nice.
> > - I really dislike the "giant list of checks" and the ongoing
> > maintenance burden it implies.
> >
> > Overall, I have to side with Duy and say I don't think we should take
> > the patch as-is.  Since that'll be frustrating for Ben to hear, I've
> > tried to come up with some additional alternatives:
>
> Thank you for the thorough review and summary.  From my perspective, it
> is both fair and accurate.
>
...
> > - Rewrite this patch so it instead does a very small set of checks at
> > the beginning of cmd_checkout(); e.g. check if argc == 3 and argv[1]
> > == "-b" and if so then perform the minimum operations needed to create
> > and checkout the new branch (maybe even calling in to cmd_branch() and
> > cmd_symbolic_ref() and calling some reflog update function).  Preface
> > it with a comment that it's a performance hack that might eventually
> > be able to go away.
>
> I'm happy to do this if it would make the patch better/more acceptable
> to the community.

For whatever it's worth, it'd change my vote.  While I would like
general optimizations more, I do think that's overall a longer term
prospect, and the wins here are big enough to justify a performance
hack -- if the amount of new code and added maintenance overhead is
minimized.  I think that checking for the args matching checkout -b
rather than checking all flags that might imply something other than
args being checkout -b would be a good way to help minimize both.

> One other change I could do would be to address Duy's concern about the
> config option would be to remove that entirely.  That means those using
> sparse-checkout don't have the option of getting the performance win but
> the benefit is that there isn't any documentation or behavior changes
> that need to be made - things just get a lot faster if you happen to use
> the (rather common) "checkout -b" option.

Playing with sparse-checkout, it feels to me like a half-baked
feature.  It seems like it required too much manual work, and it was
sometimes hard to tell if I was misunderstanding configuration rules
or I was just running into bugs in the code.  I think I hit both but I
didn't really want to get side-tracked further, yet.  (I do want to
eventually come back to it.)  The only reason someone would go through
that pain is if it provided massive performance benefits.

Viewing things when-in-doubt through the lens of strict backwards
compatibility may inadvertently mean enforcing buggy and/or slow
behavior with sparse-checkouts.  So, while git usually focuses
strongly on backwards compatibility (which is a good thing), that
might actually be wrong in the specific case of the sparse-checkout
feature.  In particular, since the whole point of the feature is
essentially a performance hack, if the initial implementation has ways
in which it really hurts performance, it seems like it'd be more in
keeping with the point of the feature to take performance fixes and
swallow subtle behavioral changes than to require special flags to get
decent performance.

That might sound like a long winded way of saying I'm totally in favor
of dropping the config option in this case, but I'm also trying to
build a case that I think there will be other situations where we want
to subtly and maybe even not-so-subtly change behavior of
sparse-checkouts to make them both more performant and far nicer for
end users.  I'm not sure others will agree with me on this argument,
but sparse checkouts felt painful to me and I think we need something
better.

> If making these two changes would resolve the remaining concerns, I'm
> happy to make them and send another patch series.  If it will just
> result in another round of "I'm still not comfortable with this" then
> I'd rather spend my time elsewhere. :)

That's fair.  The first change would be enough to resolve the concern
for me if others strongly push back on the second
(config-option-removal) change, but I'd definitely prefer both changes
-- fewer options means less code, less ongoing maintenance work, and
less manual setup needed by users.

> > - Possibly crazy idea for massive global performance win: modify how
> > the cache works so that it can have tree entries instead of just file
> > entries; make use of that with sparse checkouts (and partial clones?)
> > so that the cache only stores a tree when that entire tree is
> > "unwanted".  Suddenly, git status, git add, git checkout, etc., etc.
> > are all much faster.  merge needs some special work to operate,
> > though.
>
> Not crazy at all!  It's something we've investigated and discussed
> several times over the past couple of years.  This is the largest
> potential optimization that I am aware of as it has the potential to
> dramatically speed up most git operations as they become O(what I care
> about) vs O(size of repo/tree) - especially when paired with partial clone.
>
> The downside is pretty obvious, pulling this off means rewriting an
> awful lot of the code!  The index is central to so much of git and the
> assumption is everywhere that the index is simply a (global) array of
> cache entries to be looped through and manipulated at will.
>
> There would be a lot of refactoring work, creating a new index
> abstraction API, updating the code to utilize it, then updating the
> internal representation of the index to optimize it around the concept
> of being sparse/partial.  Likely, we'd want/need a new on disk format
> that would support the new structures in an optimal way.  Finally, the
> various algorithms (like unpack_trees()) could be rewritten to take
> advantage of the sparseness.
>
> Not something that can be taken on lightly - but certainly something
> that has huge potential to enable git to scale to much larger repos.

Well, the amount of work involved might qualify it as a "crazy" idea.
:-)  But, I'm glad others think it sounds like a useful path to purse.
Not only would it be a tremendous help performance-wise, it might also
be beneficial for another actually-crazy feature idea of mine.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux