On Wed, Jul 25, 2018 at 10:56 PM Ben Peart <peartben@xxxxxxxxx> wrote: > I'm still very new to this part of the code so am trying to figure out > what you're suggesting. I've read your description a few times and what > I'm getting out of it is that with some additional checks (ie verify > it's a twoway_merge, df_conflict_entry, not CE_CONFLICTED) that we > should be able to skip the whole tree similar to how Peff demonstrated > below without having to invalidate the cache tree to reflect modified > on-disk files. Is that correct or am I missing something? And I didn't give you an easy time because I was not very clear in my suggestion, I think. So let's start again. But first let's start with a potentially more generic optimization using cache-tree that I noticed just now. You now know traverse_trees() is used to walk N trees and the index at the same time. Cache tree is also used to quickly check if a big chunk of the index matches some tree object. So what if we try to avoid tree objects if possible (which reduces I/O, object inflation and tree parsing cost)? Let's say we're walking two trees X and Y, then we notice through cache-tree that X is the same in the index. Then instead of walking the actual X, you could just get the same entry from the index and make it "X". This way you only need to walk Y and the index (until the shared tree ends of course). If Y happens to match cache-tree too, all the better! Let's get back to two-way merge. I suggest you read the two-way merge in git-read-tree.txt. That table could give you a pretty good idea what's going on. twoway_merge() will be given a tuple of three entries (I, H, M) of the same path name, for every path. I think what we need is determine the condition where the outcome is known in advance, so that we can just skip walking the index for one directory. One of the checks we could do quickly is I==M or I==H (using cache-tree) and H==M (using tree hash). The first obvious cases that we can optimize are clean (H==M) ------ 14 yes exists exists keep index 15 no exists exists keep index In other words if we know H==M, there's no much we need to do since we're keeping the index the same. But you don't really know how many entries are in this directory where H==M. You would need cache-tree for that, so in reality it's I==H==M. The "clean" column is what fsmonitor comes in, though I'm not sure if it's actually needed. I haven't checked how '-u' flag works. There's two other cases that we can also optimize, though I think it's less likely to happen: clean I==H I==M (H!=M) ------------------ 18 yes no yes exists exists keep index 19 no no yes exists exists keep index Some other cases where I==H can benefit from the generic tree walk optimization above since we can skip parsing H. -- Duy