Re: [PATCH v1 0/3] [RFC] Speeding up checkout (and merge, rebase, etc)

Duy Nguyen <pclouds@xxxxxxxxx> · Thu, 26 Jul 2018 07:30:20 +0200

On Wed, Jul 25, 2018 at 10:56 PM Ben Peart <peartben@xxxxxxxxx> wrote:
> I'm still very new to this part of the code so am trying to figure out
> what you're suggesting.  I've read your description a few times and what
> I'm getting out of it is that with some additional checks (ie verify
> it's a twoway_merge, df_conflict_entry, not CE_CONFLICTED) that we
> should be able to skip the whole tree similar to how Peff demonstrated
> below without having to invalidate the cache tree to reflect modified
> on-disk files.  Is that correct or am I missing something?

And I didn't give you an easy time because I was not very clear in my
suggestion, I think. So let's start again. But first let's start with
a potentially more generic optimization using cache-tree that I
noticed just now.

You now know traverse_trees() is used to walk N trees and the index at
the same time. Cache tree is also used to quickly check if a big chunk
of the index matches some tree object. So what if we try to avoid tree
objects if possible (which reduces I/O, object inflation and tree
parsing cost)? Let's say we're walking two trees X and Y, then we
notice through cache-tree that X is the same in the index. Then
instead of walking the actual X, you could just get the same entry
from the index and make it "X". This way you only need to walk Y and
the index (until the shared tree ends of course). If Y happens to
match cache-tree too, all the better!

Let's get back to two-way merge. I suggest you read the two-way merge
in git-read-tree.txt. That table could give you a pretty good idea
what's going on. twoway_merge() will be given a tuple of three entries
(I, H, M) of the same path name, for every path. I think what we need
is determine the condition where the outcome is known in advance, so
that we can just skip walking the index for one directory. One of the
checks we could do quickly is I==M or I==H (using cache-tree) and H==M
(using tree hash).

The first obvious cases that we can optimize are

clean (H==M)
       ------
     14 yes                 exists   exists   keep index
     15 no                  exists   exists   keep index

In other words if we know H==M, there's no much we need to do since
we're keeping the index the same. But you don't really know how many
entries are in this directory where H==M. You would need cache-tree
for that, so in reality it's I==H==M.

The "clean" column is what fsmonitor comes in, though I'm not sure if
it's actually needed. I haven't checked how '-u' flag works.

There's two other cases that we can also optimize, though I think it's
less likely to happen:

        clean I==H  I==M (H!=M)
       ------------------
     18 yes   no    yes     exists   exists   keep index
     19 no    no    yes     exists   exists   keep index

Some other cases where I==H can benefit from the generic tree walk
optimization above since we can skip parsing H.
-- 
Duy