Re: [PATCH 02/10] unpack-trees: make sparse aware

Elijah Newren <newren@xxxxxxxxx> · Wed, 21 Apr 2021 09:11:37 -0700

// Adding Matheus to cc due to the ignore_skip_worktree bit, given his
experience and expertise with the checkout and unpack-trees code.

On Wed, Apr 21, 2021 at 6:41 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 4/20/2021 7:00 PM, Elijah Newren wrote:
> > On Tue, Apr 13, 2021 at 7:01 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@xxxxxxxxx> wrote:
> >> diff --git a/dir.h b/dir.h
> >> index 51cb0e217247..9d6666f520f3 100644
> >> --- a/dir.h
> >> +++ b/dir.h
> >> @@ -503,7 +503,7 @@ static inline int ce_path_match(struct index_state *istate,
> >>                                 char *seen)
> >>  {
> >>         return match_pathspec(istate, pathspec, ce->name, ce_namelen(ce), 0, seen,
> >> -                             S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
> >> +                             S_ISSPARSEDIR(ce->ce_mode) || S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
> >
> > I'm confused why this change would be needed, or why it'd semantically
> > be meaningful here either.  Doesn't S_ISSPARSEDIR() being true imply
> > S_ISDIR() is true (and perhaps even vice versa?).
> >
> > By chance, was this a leftover from your early RFC changes from a few
> > series ago when you had an entirely different mode for sparse
> > directory entries?
>
> I will double-check on this with additional testing and debugging.
> Your comments below make it clear that this patch would benefit from
> some additional splitting.
>
> >>  }
> >>
> >>  static inline int dir_path_match(struct index_state *istate,
> >> diff --git a/preload-index.c b/preload-index.c
> >> index e5529a586366..35e67057ca9b 100644
> >> --- a/preload-index.c
> >> +++ b/preload-index.c
> >> @@ -55,6 +55,8 @@ static void *preload_thread(void *_data)
> >>                         continue;
> >>                 if (S_ISGITLINK(ce->ce_mode))
> >>                         continue;
> >> +               if (S_ISSPARSEDIR(ce->ce_mode))
> >> +                       continue;
> >>                 if (ce_uptodate(ce))
> >>                         continue;
> >>                 if (ce_skip_worktree(ce))
> >
> > Don't we have S_ISSPARSEDIR(ce->ce_mode) implies ce_skip_worktree(ce)?
> >  Is this a duplicate check?  If so, is it still desirable for
> > future-proofing or code clarity, or is it strictly redundant?
>
> You're right, we could skip this one because the ce_skip_worktree(ce)
> is enough to cover this case. I think I created this one because I was
> auditing uses of S_ISGITLINK().
>
> >> diff --git a/read-cache.c b/read-cache.c
> >> index 29ffa9ac5db9..6308234b4838 100644
> >> --- a/read-cache.c
> >> +++ b/read-cache.c
> >> @@ -1594,6 +1594,9 @@ int refresh_index(struct index_state *istate, unsigned int flags,
> >>                 if (ignore_skip_worktree && ce_skip_worktree(ce))
> >>                         continue;
> >>
> >> +               if (istate->sparse_index && S_ISSPARSEDIR(ce->ce_mode))
> >> +                       continue;
> >> +
> >
> > I'm a bit confused about what could trigger ce_skip_worktree(ce) &&
> > !ignore_skip_worktree and why it'd be desirable to refresh
> > skip-worktree entries.  However, this is tangential to your patch and
> > has apparently been around since 2009 (in particular, from 56cac48c35
> > ("ie_match_stat(): do not ignore skip-worktree bit with
> > CE_MATCH_IGNORE_VALID", 2009-12-14)).
>
> This is probably better served with a statement like this earlier in
> the method:
>
>         if (ignore_skip_worktree)
>                 ensure_full_index(istate);
>
> It seems like ignoring the skip worktree bits is a rare occasion and
> it will be worth expanding the index for that case.

Maybe...I read the commit message that introduced the behavior and
it's not very convincing to me that SKIP_WORKTREE should be ignored
(it's also not that clear to me what the conditions are; is it just
update-index --really-refresh?); it may be worth double checking on
that assumption first, especially given how many other bugs existed
with skip_worktree stuff for years.  If it's necessary, then I agree
that your extra if-check makes sense.

In particular, I think it'd be really dumb for "update-index
--really-refresh" to read in and populate a huge subdirectory just to
stat files that don't exist because they are in directories that don't
exist.  And I think there's a pretty good argument to not update stat
information for skip_worktree entries in non-sparse-index cases even
in the presence of that flag, especially given Matheus' other recent
changes in this area (the emails just before we got to the point of
discussing SKIP_WORKTREE and racy clean entries...speaking of which,
it might be worthwhile pinging Matheus' for opinions on this issue
too.)

> >>                 if (pathspec && !ce_path_match(istate, ce, pathspec, seen))
> >>                         filtered = 1;
> >>
> >> diff --git a/unpack-trees.c b/unpack-trees.c
> >> index dddf106d5bd4..9a62e823928a 100644
> >> --- a/unpack-trees.c
> >> +++ b/unpack-trees.c
> >> @@ -586,6 +586,13 @@ static void mark_ce_used(struct cache_entry *ce, struct unpack_trees_options *o)
> >>  {
> >>         ce->ce_flags |= CE_UNPACKED;
> >>
> >> +       /*
> >> +        * If this is a sparse directory, don't advance cache_bottom.
> >> +        * That will be advanced later using the cache-tree data.
> >> +        */
> >> +       if (S_ISSPARSEDIR(ce->ce_mode))
> >> +               return;
> >> +
> >
> > I don't understand cache_bottom stuff; we might want to get Junio to
> > look over it.  Or maybe I just need to dig a bit further and attempt
> > to understand it.
>
> I remember looking very careful at this when I created this (and found
> it worth a comment) but I don't recall enough off the top of my head.
> This is worth splitting out with a careful message, which will force me
> to reexamine the cache_bottom member.
>
> >>         if (o->cache_bottom < o->src_index->cache_nr &&
> >>             o->src_index->cache[o->cache_bottom] == ce) {
> >>                 int bottom = o->cache_bottom;
> >> @@ -984,6 +991,9 @@ static int do_compare_entry(const struct cache_entry *ce,
> >>         ce_len -= pathlen;
> >>         ce_name = ce->name + pathlen;
> >>
> >> +       /* remove directory separator if a sparse directory entry */
> >> +       if (S_ISSPARSEDIR(ce->ce_mode))
> >> +               ce_len--;
> >>         return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode);
> >
> > Shouldn't we be passing ce->ce_mode instead of S_IFREG here as well?
> >
> > Note the following sort order:
> >    foo
> >    foo.txt
> >    foo/
> >    foo/bar
> >
> > You've trimmed off the '/', so 'foo/' would be ordered where 'foo' is,
> > but df_name_compare() exists to make "foo" sort exactly where "foo/"
> > would when "foo" is a directory.  Will your df_name_compare() call
> > here result in foo.txt being placed after all the "foo/<subpath>"
> > entries in the index and perhaps cause other problems down the line?
> > (Are there issues, e.g. with cache-trees getting wrong ordering from
> > this, or even writing out indexes or tree objects with the wrong
> > ordering?  I've written out trees to disk with wrong ordering before
> > and git usually survives but gets really confused with diffs.)
> >
> > Since at least one caller of compare_entry() takes the return result
> > and does a "if (cmp < 0)", this order is going to matter in some
> > cases.  Perhaps we need some testcases where there is a sparse
> > directory entry named "foo/" and a file recorded in some relevant tree
> > with the name "foo.txt" to be able to trigger these lines of code?
>
> I will do some testing to find out why removing the separator here was
> necessary or valuable.

I think you removed the separator because df_name_compare() assumes it
gets a regular filename (i.e. no trailing '/') and manually adds one
based on mode for directories.  You were probably worried about what
amounts to a non-sensical double '/', but df_name_compare() wouldn't
actually get to that point unless someone somehow recorded a path
within a git tree object that ended with a trailing '/'.  I'd rather
not have to worry about the double '/' and explain why it isn't
possible (or wonder about whether git trees with trailing '/'
characters could be recorded on some OS), so I think the trimming of
the separator as you did makes sense.

What doesn't make sense to me is that the code just below had a
hardcoded S_IFREG that it passed to df_name_compare, based on "this is
a cache entry, and index entries are _always_ regular files".  You
didn't change that, even though it's now a false assumption.
symlinks, and regular files should be passed as S_IFREG there, I'm not
sure what should be passed for submodules (though the fact that it's
been using S_IFREG for years suggests maybe that is the mode we want
for it, so we can't use ce->ce_mode), and I'm pretty sure sparse
directory entries should be passed as S_IFDIR in order to get the
sorting right unless you stop stripping the trailing '/' character.
I'm not exactly sure where the sorting for do_compare_entry() affects
the code later, but I tried to trace it out a little in my comments
above in order to guide some testing.

> >>  }
> >>
> >> @@ -993,6 +1003,10 @@ static int compare_entry(const struct cache_entry *ce, const struct traverse_inf
> >>         if (cmp)
> >>                 return cmp;
> >>
> >> +       /* If ce is a sparse directory, then allow equality here. */
> >> +       if (S_ISSPARSEDIR(ce->ce_mode))
> >> +               return 0;
> >> +
> >
> > Um...so a sparse directory compares equal to _anything_ at all?  I'm
> > really confused why this would be desirable.  Am I missing something
> > here?
>
> The context is that is removed from the patch is that "cmp" is the
> response from do_compare_entry(), which does a length-limited comparison.
> If cmp is non-zero, then we've already returned the difference.
>
> The rest of the method is checking if the 'info' input is actually a
> parent directory of the _path_ given at this cache entry.

Ah, thanks for the explanation.  So the only way we get here with
cmp==0 when we're dealing with a sparse directory entry is if we found
a directory by the same name....

> >>         /*
> >>          * Even if the beginning compared identically, the ce should
> >>          * compare as bigger than a directory leading up to it!
>
> The line after this is:
>
>         return ce_namelen(ce) > traverse_path_len(info, tree_entry_len(n));
>
> This comparison is saying "these paths match up to the directory specified
> by info and n, but we need 'ce' to be a file within that directory." But
> in the case of a sparse directory entry, we can skip this comparison.

Isn't this "must skip" rather than "can skip"?  If we're considering
the ce path "foo/bar/", then the traverse_path would be "foo/bar" and
we'd have:
    ce_namelen(ce) == 1 + traverse_path_len(info, tree_entry_len(n))
so this would return 1 for the comparison making them be treated as
non-equal even though they are what we consider equal entries.

In any event, it seems like this new check could use a better comment
than "then allow equality here".

> >> @@ -1243,6 +1257,7 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
> >>         struct cache_entry *src[MAX_UNPACK_TREES + 1] = { NULL, };
> >>         struct unpack_trees_options *o = info->data;
> >>         const struct name_entry *p = names;
> >> +       unsigned recurse = 1;
> >
> > "recurse" sent my mind off into questions about safety checks, base
> > cases, etc., instead of just the simple "we don't want to read in
> > directories corresponding to sparse entries".  I think this would be
> > clearer either if the variable had the sparsity concept embedded in
> > its name somewhere (e.g. "unsigned sparse_entry = 0", and check for
> > (!sparse_entry) instead of (recurse) below), or with a comment about
> > why there are cases where you want to avoid recursion.
>
> I can understand that. This callback is confusing because it _does_
> recurse, but through a sequence of methods instead of actually calling
> itself.
>
> It would be better to say something like "unpack_subdirectories = 1"
> and disabling it when we are in a sparse directory.

I like that name.

>
> >>
> >>         /* Find first entry with a real name (we could use "mask" too) */
> >>         while (!p->mode)
> >> @@ -1284,12 +1299,16 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
> >>                                         }
> >>                                 }
> >>                                 src[0] = ce;
> >> +
> >> +                               if (S_ISSPARSEDIR(ce->ce_mode))
> >> +                                       recurse = 0;
> >
> > Ah, the context here doesn't show it but this is in the "if (!cmp)"
> > block, i.e. if we found a match for the sparse directory.  This makes
> > sense, to me, _if_ we ignore the above question about sparse
> > directories matching equal to anything and everything.
>
> I believe that "anything and everything" concern has been resolved.

Yes, if we just improve the "then allow equality here" comment.

> >> @@ -1319,7 +1338,8 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
> >>                         }
> >>                 }
> >>
> >> -               if (traverse_trees_recursive(n, dirmask, mask & ~dirmask,
> >> +               if (recurse &&
> >> +                   traverse_trees_recursive(n, dirmask, mask & ~dirmask,
> >>                                              names, info) < 0)
> >>                         return -1;
> >>                 return mask;
> >
> > Nice.  :-)
> >
> >
> > I think your patch was mostly about the recurse stuff, which other
> > than the name or a comment about it look good to me.  However, all the
> > other preparatory small tweaks brought up a lot of questions or
> > confusion for me.  I'm worried there might be a bug or two, though I
> > may have just misunderstood some of the code bits.
>
> This patch could probably be split up a little to make these things
> clearer. Thanks for bringing up the tricky bits.
>
> -Stolee