Re: tree sorting..

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[adding in the list]
Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On irc:
> 
>   spearce: personally the utf-8 commit encoding mess is less of a mess 
> 	then the "lets sort trees with / after their name, but not store 
> 	them that way!".
> 
> umm, we *do* store them that way, so I'm a bit confused?

Sorry, hasty IRC messages don't always read very well.  What I was
trying to say was that the decision to sort entries in the tree
the way that they are sorted is an annoying problem.

Entries in a tree must be unique by their name; that is I cannot have
"foo" as both a file and as a subtree.  An excellent constraint
to have.  But the position of the entry "foo" is different for a
file than for a subtree, as the subtree "foo" is sorted as though
it were "foo/" while the file "foo" is sorted by just "foo".

So there's additional complexity imposed.  I cannot just binary
search a tree and locate "foo", as "foo" could appear at one of two
different positions in that tree.  Sure I might know that the user
is asking for "foo/" (as the input was "foo/bar"); but what if I
search for "foo/" and in this particular tree there is a "foo",
but its a file?  I cannot conclude that foo doesn't exist here
without searching *again* for "foo" on its own.  Likewise if the
user asks for "foo" (no trailing slash).


The entire reason this happened is hysterical raisins.  Your first
draft of Git had trees as one flat structure, sorting complete
paths by their memcmp result.  Which is dead simple and worked.

But folks pointed out it would be more scalable to store each
directory on its own, as not every directory changes on each commit.
So you split the trees up in the object database, but you left the
index as-is.

This meant that as you were walking the index you found the subtree
"foo" where "foo/" appears, not where "foo" appears.  In my opinion
it is a *BUG* that you did not correct the sorting to remove the
trailing slash when you split the index down into individual tree
objects.

Now we are stuck with it.

And that flat index?  It worked well to get us started, sure, but
the whole cache_tree thing that Junio wedged in was specifically to
address the scability of the index on larger repositories.  We don't
need to recreate (or examine) trees for parts that didn't change.

Wasn't that the original scability argument as to *why* you broke
out the tree objects in the ODB apart?  How was the index exempt
from that?

Basically you cut a corner way back by not also splitting the
index up by subtree when you split the ODB "tree" up by subtree,
and we've been paying for it ever since.


I'm personally annoyed about it because I've fought this problem
several times.  In jgit (my Java Git implementation) I can't do a
simple binary search on a tree's contents, because of the issues
stated above.  In git-fast-import I can't do a simple binary search,
because of the issues stated above.  There I also have to resort a
tree *twice* to generate the delta, because a path name might have
changed type and thus changed positions within the tree.

Last night we found a bug where you can't do a two-head merge with
`read-tree -m` when "A" is a subdirectory in this branch but is a
file in the branch you are switching to.  This is difficult to fix
because within the index we cannot simply do a type comparsion of
"A" and "A" to see if they are the same...  Junio is trying to
patch around it, but it ain't pretty.

With the new pack v4 concept of a 6 byte fixed-width tree record,
we could actually do binary search within a tree, rather than
linear search.  But again, its uh, more difficult than it probably
should be.


We all do things that in hindsight could have been done differently.
I *definately* would rather have the Git we have today, than to
not have one at all.  I'm extremely grateful that you and several
others took time away from Linux kernel hacking to get this project
started, and kept a practical focus throughout, ensuring we would
have a useful tool.

I just get frustrated with it sometimes.  ;-)

-- 
Shawn.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]