Re: Empty directories...

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 17 Jul 2007 23:53:25 -0700

David Kastrup <dak@xxxxxxx> writes:

> Junio C Hamano <gitster@xxxxxxxxx> writes:
>
>> No objections as long as a patch is cleanly made without
>> regression.  It's just nobody agreed that it is "quite serious"
>> yet so far, and no fundamental reason against it.
>
> Thanks.  It certainly is not serious for the Linux kernel source, but
> seems awkward for quite a few situations.  Anyway, what is your take
> on the situation I described?

Didn't I say I do not have an objection for somebody who wants
to track empty directories, already?  I probably would not do
that myself but I do not see a reason to forbid it, either.

The right approach to take probably would be to allow entries of
mode 040000 in the index.  Traditionally, we allowed only 100644
(blobs as regular files) and 120000 (blobs as symlinks).  We
recently added 160000 (commit from outer space, aka subproject).

And we do that for all directories, not just empty ones.  So if
you have fileA, empty/, sub/fileB tracked, your index would
probably have these four entries, immediately after read-tree
of an existing tree object:

	100644 15db6f1f27ef7a... 0	fileA
	040000 4b825dc642cb6e... 0	empty
	040000 e125e11d3b63e3... 0	sub
	100644 52054201c2a872... 0	sub/fileB

Making sure that empty/ directory exists in the working tree is
probably done in entry.c; we have been touching that area in an
unrelated thread in the past few days.

If you add sub/fileC, with "update-index" (and "add"), you
invalidate the SHA-1 object name you stored for "sub" (because
there is no point recomputing the tree object until you know you
need a subtree for "sub" part, which does not happen until the
next "write-tree"), and end up with something like:

	100644 15db6f1f27ef7a... 0	fileA
	040000 4b825dc642cb6e... 0	empty
	040000 00000000000000... 0	sub
	100644 52054201c2a872... 0	sub/fileB
	100644 705bf16c546f32... 0	sub/fileC

These "missing" SHA-1 would need to be recomputed on-demand.

We have had necessary infrastructure to do this "keeping
untouched tree object names in the index" for quite some time,
but it is not a part of the index proper (it is stored in an
extension section in the index file, to keep the index
compatible with older versions of git).

Having made it sound so easy, here are the issues I would expect
to be nontrivial (but probably not rocket surgery either).

 * unpack-trees, which is the workhorse for twoway merge (aka
   "switching branches") and threeway merge, has a convoluted
   logic to avoid D/F conflicts; it can probably be cleaned up
   once we do the above conversion so that the index starts
   saying "Hey, I have a directory here" more explicitly.  The
   end result would probably be a code easier to follow.

 * status, update-index --refresh, and diff-files cares about
   the information cached in the index from the last time
   lstat(2) is run on each entry.  What we should store there
   for "tree" entries is very unclear to me, but probably we
   should teach them to ignore the stat-matching logic for
   these entries.

 * diff-index walks the index and a tree in parallel but does
   not currently expect to see a tree object in the index.  It
   needs to be taught to ignore these "tree" entries.

 * merge-recursive and merge-index walk the index, coming up
   with the merge results one path at a time.  They also need to
   be taught to ignore these "tree" entries.

 * diff-index and "read-tree -m" should be taught to take
   advantage of the "tree" entries in the index.  For example,
   if diff-index finds the "tree" entry in the index and the
   subtree found from the tree object exactly match, it does not
   even have to descend into the tree, which would be a huge
   performance win (because you do not have to open the subtree
   and its subtrees from the tree side; you already have read
   everything on the index side, and still have to skip the
   entries in the directory).  "read-tree -m" also should be
   able to optimize two identical subtrees in the 2 or 3 trees
   involved.

   Even if we follow the "lazy invalidate" strategy to maintain
   the "tree" entries in the normal codepath, we could have a
   special operation that says "now update all the tree entries
   by recomputing the tree object names as needed".  Perhaps we
   might want to initiate such an operation before "read-tree
   -m" automatically.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html