Re: Empty directories...

David Kastrup <dak@xxxxxxx> · Sun, 22 Jul 2007 23:35:10 +0200

Coming full circle...

Junio C Hamano <gitster@xxxxxxxxx> writes:

> The right approach to take probably would be to allow entries of
> mode 040000 in the index.  Traditionally, we allowed only 100644
> (blobs as regular files) and 120000 (blobs as symlinks).  We
> recently added 160000 (commit from outer space, aka subproject).
>
> And we do that for all directories, not just empty ones.  So if
> you have fileA, empty/, sub/fileB tracked, your index would
> probably have these four entries, immediately after read-tree
> of an existing tree object:
>
> 	100644 15db6f1f27ef7a... 0	fileA
> 	040000 4b825dc642cb6e... 0	empty
> 	040000 e125e11d3b63e3... 0	sub
> 	100644 52054201c2a872... 0	sub/fileB

This would be very much what I am proposing now, except that instead
of 040000 we would have 040755 usually, so that when the index makes
it into the repository where 040000 already has a meaning (a
disappear-when-empty tree) we get the right information.  Also note
that the above comes about when doing
git-add *
but not when doing
git-add fileA empty sub/fileB (in the latter case, the entry for sub
                               would be missing)

> If you add sub/fileC, with "update-index" (and "add"), you
> invalidate the SHA-1 object name you stored for "sub" (because
> there is no point recomputing the tree object until you know you
> need a subtree for "sub" part, which does not happen until the
> next "write-tree"), and end up with something like:
>
> 	100644 15db6f1f27ef7a... 0	fileA
> 	040000 4b825dc642cb6e... 0	empty
> 	040000 00000000000000... 0	sub
> 	100644 52054201c2a872... 0	sub/fileB
> 	100644 705bf16c546f32... 0	sub/fileC
>
> These "missing" SHA-1 would need to be recomputed on-demand.

Ah, ok.  Does it even make sense to compute the SHA-1 values in the
index in advance?  What would they be useful for?

> We have had necessary infrastructure to do this "keeping
> untouched tree object names in the index" for quite some time,
> but it is not a part of the index proper (it is stored in an
> extension section in the index file, to keep the index
> compatible with older versions of git).

What is the application for which this is being used?

> Having made it sound so easy, here are the issues I would expect
> to be nontrivial (but probably not rocket surgery either).
>
>  * unpack-trees, which is the workhorse for twoway merge (aka
>    "switching branches") and threeway merge, has a convoluted
>    logic to avoid D/F conflicts; it can probably be cleaned up
>    once we do the above conversion so that the index starts
>    saying "Hey, I have a directory here" more explicitly.  The
>    end result would probably be a code easier to follow.

I am afraid that this is unlikely to happen, and that is because
directory tracking remains optional at a fundamental level as long as
we want to support the current behavior as an option.  However, one
could conceivably add 040000 entries (rather than 040755) for
directories that have not been passed into tracking but are required
by git, if this simplifies matters.  But it sounds like something that
might complicate working with several different git versions on the
same index.

>  * status, update-index --refresh, and diff-files cares about
>    the information cached in the index from the last time
>    lstat(2) is run on each entry.  What we should store there
>    for "tree" entries is very unclear to me, but probably we
>    should teach them to ignore the stat-matching logic for
>    these entries.

At the current point of time, git tracks just the u+x bit for normal
files, and for directories, there is really nothing worth tracking as
long as no attempt of restoring more mode bits is done.  Modification
times are probably a bit too risky to pay attention to.

>  * diff-index walks the index and a tree in parallel but does
>    not currently expect to see a tree object in the index.  It
>    needs to be taught to ignore these "tree" entries.

Or do something sensible when comparing.  Understood.

>  * merge-recursive and merge-index walk the index, coming up
>    with the merge results one path at a time.  They also need to
>    be taught to ignore these "tree" entries.

Same here.

>  * diff-index and "read-tree -m" should be taught to take
>    advantage of the "tree" entries in the index.  For example,
>    if diff-index finds the "tree" entry in the index and the
>    subtree found from the tree object exactly match, it does not
>    even have to descend into the tree, which would be a huge
>    performance win (because you do not have to open the subtree
>    and its subtrees from the tree side; you already have read
>    everything on the index side, and still have to skip the
>    entries in the directory).  "read-tree -m" also should be
>    able to optimize two identical subtrees in the 2 or 3 trees
>    involved.
>
>    Even if we follow the "lazy invalidate" strategy to maintain
>    the "tree" entries in the normal codepath, we could have a
>    special operation that says "now update all the tree entries
>    by recomputing the tree object names as needed".  Perhaps we
>    might want to initiate such an operation before "read-tree
>    -m" automatically.

Over my head, but it would appear that it can safely left for later.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html