Re: [RFC PATCH] Re: Empty directories...

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Sat, 21 Jul 2007, David Kastrup wrote:
>
>> Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:
>> 
>> > Of course, it seldom matters, but basically, you should test a directory 
>> > structure that has the files
>> >
>> > 	dir.c
>> > 	dir/test
>> >
>> > in it, and the "dir" directory should always sort _after_ "dir.c".
>> >
>> > And yes, having the index entry with a '/' at the end would handle
>> > that automatically.
>> 
>> You completely lost me here.  I guess I'll be able to pick this up
>> only after investing considerable more time into the data structures.

[Basic explanation about git sort order and trees sorting as tree/ in
order to be in the right sort order for a prefix]

Ok, I could not have figured this out on my own.  Are there any design
documents or does one just have to pester the list?

> So the basic issue is that not only does git obviously think that only 
> content matters, but it describes it with a single SHA1. 
>
> That's not an issue at all for a single file, but if you want to describe 
> *multiple* files with a single SHA1 (which git obviously very much wants 
> to do), the way you generate the SHA1 matters a lot.
>
> In particular, the order.
>
> So git is very very strict about the ordering of tree structures. A tree 
> structure is not just a random list of
>
> 	<ASCII mode> + <space> + <filename> + <NUL> + <SHA1>

Ok.

> So git filenames are very much a "stream of bytes", not anything
> else. And they need to sort 100% reliably, always the same way, and
> never with any localized meaning.

There is some utf-8/Unicode trouble to be expected in connection with
that eventually: some, but not all operating and/or file systems
canonicalize file names, replacing accented letters by a combining
accent and the letter.  But that's beside the point.

> And, partly because it seemed most natural, and partly for
> historical reasons, the way git sorts filenames is by sorting by
> *pathname*. So if you have three files named
>
> 	a.c
> 	a/c
> 	abc
>
> then they sort in that exact order, and no other! They sort as a
> "memcmp" in the full pathname, and that's really nice when you see
> whole collections of files, and you know the list is globally
> sorted.

It is amusing that my description of git having no external concept of
directories except as an expedience for representing slashes in
filenames was much closer to the mark that I would have expected.

> So that "global pathname sorting" has nice properties, and it seems 
> "obvious", but it means that because git actually *encodes* those three 
> files hierarchically as two different trees (because there's a 
> subdirectory there), the tree objects themselves sort a bit oddly. The 
> tree obejcts themselves will look like
>
>  top-level tree:
> 	100644 a.c -> blob1
> 	040000 a   -> tree2
> 	100644 abc -> blob3
>
>  sub-tree:
> 	100644 c    -> blob2
>
> and notice how the *tree* is not sorted alphabetically at all. It has a 
> subtly different sort, where the entry "a" sorts *after* the entry "a.c", 
> because we know that it's a tree entry, and thus will (in the *global* 
> order) sort as if it had a "/" at the end!
>
> Traditionally, when we have the index, the index sorting has been very 
> simple: you just sort the names as memcmp() would sort them. But note how 
> that changes, if "a" is an empty directory. Now the index needs to sort as
>
> 	file a.c
> 	dir  a
> 	file abc
>
> because when we create the tree entry, it needs to be sorted the same way 
> all tree entries are always sorted - as if "a" had a slash at the end!

Here is the layout as I would scheme it:

tree1:
     0?0000 .   -> dir1
     100644 a.c -> blob1
     040000 a   -> tree2
     100644 abc -> blob3

sub-tree:
     0?0000 .    -> dir2
     100644 c    -> blob2

Remember that a tree evaporates when it is empty, and if we don't want
to mess with that (which appears like a good idea to me), the "don't
delete this" indication belongs in the subtree where its natural name
is ".".  Since the dir entries are _leaves_ in the tree, there is no
necessity for sorting them specially.  They will usually appear first,
but people to all sorts of things, so filenames starting with "!"
might still come before them.

So the sorted flat file list for the above would be
.    [dir]
a.c  [file]
a/   [tree]
a/.  [dir]
a/c  [file]
abc  [file]

Note that a tree is basically just a string arrangement tool which
gets only incidentally mapped to directories when checking out.

So I am quite unhappy that 040000 is already taken by it.  I can't
even say, "ok, let . look like an empty tree" because there should not
be something like an empty tree!  I find the correlation empty->gone
very important.

> [ Yeah, yeah, we could make a special case and just say "the empty
> tree sorts differently", but that actually results in huge problems
> when doing a "diff" between two trees: our diff machinery very much
> depends on the fact that the index and the trees always sort the
> same way, and if we sorted the "a" entry (when it is an empty
> directory) differently from the "a" entry (when it has entries in
> it), that would just be insane and cause no end of trouble for
> comparing two trees - one with an empty directory and one with
> content added to that directory.

It appears to me like our ideas are still out of sync: a directory
under my scheme is _not_ at all an empty tree, rather it is an entry
_inside_ of a tree, making the tree non-empty (which means that git
will not be tempted to delete the corresponing real-world directory
_until_ one deletes the directory entry keeping the tree alive).

>   So the sorting is doubly important: it's what makes "one content"
>   always have the same SHA1, but it is also much easier and
>   efficient to compare directories when we know they are sorted the
>   same way. ]
>
> It's *probably* just a few lines of code, and it actually would
> result in some nice changes ("git ls-files" would show a '/' at the
> end of an empty directory entry, for example), so this is not a big
> deal, but it's an example of how subtly different a directory is
> from a file when it comes to git.

Linus, a directory is simply non-existent inside of git.  Trees are an
indexing mechanism solely determined by their content.  That is not a
subtle difference.  Git _uses_ directories when exporting in order to
simulate a flat namespace.  But it is internally oblivious to their
existence.  And that is a perfectly elegant and reasonable approach
and I like it very much and don't want to mess with it at all.

But I also want to have directories represented within git, because
not doing so leads to awkward problems.  And the proper way as I see
it is _not_ to mess with trees and stick them with "stay when empty"
flags or similar.  This messes up the whole elegance of git's flat
name space.  The proper way is to create a distinct object that
represents a physical directory.  We don't need to represent the
contents of it: those are already tracked in the flat namespace fine,
with trees serving as an implementation detail.

All we need to represent is ".".

So git-ls-files on
.    [dir]
a.c  [file]
a/   [tree]
a/.  [dir]
a/c  [file]
abc  [file]

should likely list

.
a.c
a/.
a/c
abc

If one wants to see the _tree_ because of its SHA1, it may also be
listed.  The SHA1 of a _directory_ like a/., in contrast, is
uninteresting: it will be the same for every directory.

Whether the _tree_ is listed as "a" or "a/" is probably a matter of
taste.  Personally, I think "a/" is better for bringing across the
notion that it is a structuring device not really related to the
physical _directory_ a which is _identical_ (meaning inode-identical,
which is what counts in the physical world) to "a/." even though it is
another name of it.

And using "a/" puts it closer to its natural sort order.

I'd write up a philosophy paper about git's relation between trees,
files, directories if that were not utterly preposterous.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux