Re: [PATCH 3/2] Optimize the two-way merge of git-read-tree too

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sat, 11 Aug 2007, Junio C Hamano wrote:
> 
> Then, the new 3-way read-tree (or unpack_trees) would go like
> this.
> 
> First, we scan the index and find locally modified paths;
> invalidate cache-tree by calling cache_tree_invalidate_path()
> for such paths.  The expectation is that you will still have
> largely usable cache-tree after this step.
> 
> The main part of the merge will be done by using tree_desc based
> traversal that walks over three (ancestor, mine and theirs)
> trees, instead of the current unpack_trees() that iterates
> mainly over the index entries.
> 
> (A) The three trees all may have a subtree at the same path, or
>     the one that does not have a tree where others have does not
>     have a blob at the path either (i.e. the subtree is missing
>     from the tree without a D/F conflict).  If the usual 3-way
>     tree level merge rule (one side changes while the other side
>     does not do anything, or both sides change to the same) can
>     be applied, and if the cache tree entry for that subtree is
>     valid and matches our tree, then...
> 
>     - we know the working tree for that entire subdirectory is
>       also clean because of the earlier invalidation;
> 
>     - we can simply follow the tree level merge without ever
>       looking at individual paths contained in the subtree.
> 
> (B) If the precondition of (A) does not hold, we recurse into
>     the subtree, and perform per-path 3-way merge, like the
>     current unpack_trees() does.

I'd love to see this, because it would likely speed up merging of really 
big trees by a huge amount, and because I think it's fundamentally the 
"right thing(tm)" to do.

That said, right now we do so well that it's almost not interesting. The 
kernel tree is so small as to merge in basically zero time even when you 
do per-file merges, and generatign the diffstat is almost always the 
biggest component of the merge by far.

But yes, for bigger trees, and just because it's the right thing to do 
considering the data structures we have, we really should plan on doing 
that some day. I think that with the current optimizations (and at least 
with current hardware - old hw might change the equation), based on my 
timings on the "bummer" tree, are fine at _least_ to 100,000 files. But if 
we're talking about slower hardware, or many more files than that, we 
should _definitely_ eventually do the tree-based optimizations.

> Another optimization I was hoping to do was "git diff --cached",
> which is unrelated to the recent change to use read_tree().

Yeah. "git diff" is just about the most performance-critical one. That 
said, the "--cached" case is likely the least used one, and any time you 
don't use that, you end up having to lstat() the files anyway, so you 
cannot do the tree-based optimizations ;^p

But it would help "git status", I think (which needs to do both --cached 
and the normal one).

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux