On Fri, 20 Oct 2006, Aaron Bentley wrote: > > Agreed. We start by comparing BASE and OTHER, so all those comparisons > are in-memory operations that don't hit disk. Only for files where BASE > and OTHER differ do we even examine the THIS version. Git just slurps in all three trees. I actually think that the current merge-recursive.c does it the stupid way (ie it expands all trees recursively, regardless of whether it's needed or not), but I should really check with Dscho, since I had nothing to do with that code. I wrote a tree-level merger that avoided doing the recursive tree reading when the tree-SHA1's matched entirely, and re-doing the latest merge using that took all of 0.037s, because it didn't recursively expand any of the uninteresting trees. But the default recursive merge was ported from the python script that did it a full tree at a time, so it's comparatively "slow". But it's fast enough (witness the under-1s time ;) that I think the motivation to be smarter about reading the trees was basically not just there, so my "git-merge-tree" thing is languishing as a proof-of-concept. So right now, git merging itself doesn't even take advantage of the "you can compare two whole directories in one go". We do that all over the place in other situations, though (it's a big reason for why doing a "diff" between different revisions is so fast - you can cut the problem space up and ignore the known-identical parts much faster). That tree-based data structure turned out to be wonderful. Originally (as in "first weeks of actual git work" in April 2005) git had a flat "file manifest" kind of thing, and that really sucked. So the data structures are important, and I think we got those right fairly early on. > We can do a do-nothing kernel merge in < 20 seconds, and that's > comparing every single file in the tree. In Python. I was aiming for > less than 10 seconds, but didn't quite hit it. Well, so I know I can do that particular actual merge in 0.037 seconds (that's not counting the history traversal to actually find the common parent, which is another 0.01s or more ;), so we should be able to comfortably do the simple merges in less than a tenth of a second. But at some point, apparently nobody just cares. Of course, this kind of thing depends a lot on developer behaviour. We had some performance bugs that we didn't notice simply because the kernel didn't show any of those patterns, but people using it for other things had slower merges. Sometimes you don't see the problem, just because you end up looking at the wrong pattern for performance. > > So recursive basically generates the matrix of similarity for the > > new/deleted files, and tries to match them up, and there you have your > > renames - without ever looking at the history of how you ended up where > > you are. > > So in the simple case, you compare unmatched THIS, OTHER and BASE files > to find the renames? Right. Some cases are easy: if one of the branches only added files (which is relatively common), that obviously cannot be a rename. So you don't even have to compare all possible combinarions - you know you don't have renames from one branch to the other ;) But I'm not even the authorative person to explain all the details of the current recursive merge, and I might have missed something. Dscho? Fredrik? Anything you want to add? Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html