Re: [PATCH v10 18/36] merge-recursive: add get_directory_renames()

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Sat, 12 Oct 2019 21:23:58 +0200 (CEST)

Hi Elijah,

On Fri, 11 Oct 2019, Elijah Newren wrote:

> On Wed, Oct 9, 2019 at 1:39 PM Johannes Schindelin
> <Johannes.Schindelin@xxxxxx> wrote:
> >
> > sorry about the blast from the past, but I just stumbled over something
> > I could not even find any discussion about:
>
> I'm curious what brought you to this part of the codebase, but either
> way, thanks for sending an email with your findings.

Well, you know, it's a loooooong story.

> More comments below...

Thank you so much, they unpuzzled me quite a bit.

Ciao,
Dscho

>
> [...]
> > > @@ -1357,6 +1395,169 @@ static struct diff_queue_struct *get_diffpairs(struct merge_options *o,
> > >       return ret;
> > >  }
> > >
> > > +static void get_renamed_dir_portion(const char *old_path, const char *new_path,
> > > +                                 char **old_dir, char **new_dir)
> > > +{
> > > +     char *end_of_old, *end_of_new;
> > > +     int old_len, new_len;
> > > +
> > > +     *old_dir = NULL;
> > > +     *new_dir = NULL;
> > > +
> > > +     /*
> > > +      * For
> > > +      *    "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
> > > +      * the "e/foo.c" part is the same, we just want to know that
> > > +      *    "a/b/c/d" was renamed to "a/b/some/thing/else"
> > > +      * so, for this example, this function returns "a/b/c/d" in
> > > +      * *old_dir and "a/b/some/thing/else" in *new_dir.
> > > +      *
> > > +      * Also, if the basename of the file changed, we don't care.  We
> > > +      * want to know which portion of the directory, if any, changed.
> > > +      */
> > > +     end_of_old = strrchr(old_path, '/');
> > > +     end_of_new = strrchr(new_path, '/');
> > > +
> > > +     if (end_of_old == NULL || end_of_new == NULL)
> > > +             return;
> > > +     while (*--end_of_new == *--end_of_old &&
> > > +            end_of_old != old_path &&
> > > +            end_of_new != new_path)
> > > +             ; /* Do nothing; all in the while loop */
> > > +     /*
> > > +      * We've found the first non-matching character in the directory
> > > +      * paths.  That means the current directory we were comparing
> > > +      * represents the rename.  Move end_of_old and end_of_new back
> > > +      * to the full directory name.
> > > +      */
> > > +     if (*end_of_old == '/')
> > > +             end_of_old++;
> > > +     if (*end_of_old != '/')
> > > +             end_of_new++;
> >
> > Is this intentional? Even after thinking about it for fifteen minutes, I
> > think it was probable meant to test for `*end_of_new == '/'` instead of
> > `*end_of_old != '/'`. And...
>
> Yeah, looks like a mess-up, and yes your suspicion is correct about
> what was intended.
>
> Hilariously, though, no bug results from this.  Since these are paths,
> as canonicalized by git (i.e. not as specified by the user where they
> might accidentally type multiple consecutive slashes), there will
> never be two slashes in a row (because we can't have directories with
> an empty name).  Thus, it is guaranteed at this point that *end_of_old
> != '/', and end_of_new is thus unconditionally advanced.  Further,
> since we wanted to find the _next_ '/' character after end_of_new,
> then there were two cases: (1) end_of_new already pointed at a slash
> character in which case we needed it to be advanced, or (2) end_of_new
> didn't point to a slash character so it wouldn't hurt at all to
> advance it.
>
> > > +     end_of_old = strchr(end_of_old, '/');
> > > +     end_of_new = strchr(end_of_new, '/');
> >
> > ... while I satisfied myself that these calls cannot return `NULL` at
> > this point, it took quite a few minutes of reasoning.
> >
> > So I think we might want to rewrite these past 6 lines, to make
> > everything quite a bit more obvious, like this:
> >
> >         if (end_of_old != old_path)
> >                 while (*(++end_of_old) != '/')
> >                         ; /* keep looking */
> >         if (end_of_new != new_path)
> >                 while (*(++end_of_new) != '/')
> >                         ; /* keep looking */
>
> I think your if-checks here are not correct.  Let's say that old_path
> was "tar/foo.c" and new_path was "star/foo.c".  The initial strrchr
> will bring both end_of_* variables back to the slash.  The moving left
> while equal will move end_of_old back to old_path (i.e. pointing to
> the "t") and end_of_new back to pointing at "t" as well.  Here's where
> your six alternate lines would kick in, and would leave end_of_old at
> old_path, while moving end_of_new to the '/', making it look like we
> had a rename of "" (the empty string or root directory) to "star"
> instead of a rename of "tar" to "star".  If you dropped your if-checks
> (just having the while loops), then I think it does the right thing.
>
> > There is _still_ one thing that makes this harder than trivial to reason
> > about: the case where one of `*end_of_old` and `*end_of_new` is a slash.
> > At this point, we assume that `*end_of_old != *end_of_new` (more about
> > that assumption in the next paragraph), therefore only one of them can
> > be a slash, and we want to advance beyond it. But even if the pointer
> > does not point at a slash, we want to look for one, so we want to
> > advance beyond it.
>
> I should probably add a comment that we want to advance BOTH to the
> next slash.  I would have just used strchr() but it wouldn't advance
> the string if it already points to what I'm looking for.  Actually, I
> guess I could simplify the code by unconditionally advancing by one
> character, then calling strchr().  In other words, simplifying these
> six lines to just
>
>        end_of_old = strchr(++end_of_old, '/');
>        end_of_new = strchr(++end_of_new, '/');
>
> > I also think that we need an extra guard: we do not handle the case
> > `a/b/c` -> `a/b/d` well. As stated a few lines above, "if the basename
> > of the file changed, we don't care". So we start looking at the last
> > slash, then go backwards, and since everything matches, end up with
> > `end_of_old == old_path` and `end_of_new == new_path`. The current code
> > will advance `end_of_new` (which I think is wrong) and then looks for
> > the next slash in both `end_of_new` and `end_of_old` (which is also
> > wrong).
>
> The current code is slightly convoluted, but I would say it's not
> wrong for this case.  If we renamed a/b/c -> a/b/d, then there isn't a
> directory rename; the leading directory (a/b/) is the same for both.
> You are right that the advancing of end_of_old and end_of_new to the
> next slash would result in what looks like a rename of "a" to "a", but
> the checks at the end checked for this case and only returned
> something for *old_dir and *new_dir if these didn't match; in fact,
> it's the part of the code at the end of your email that you didn't
> comment on, here:
>
> > > +
> > > +     /*
> > > +      * It may have been the case that old_path and new_path were the same
> > > +      * directory all along.  Don't claim a rename if they're the same.
> > > +      */
> > > +     old_len = end_of_old - old_path;
> > > +     new_len = end_of_new - new_path;
> > > +
> > > +     if (old_len != new_len || strncmp(old_path, new_path, old_len)) {
> > > +             *old_dir = xstrndup(old_path, old_len);
> > > +             *new_dir = xstrndup(new_path, new_len);
> > > +     }
> > > +}
> > > [...]
>
> However, we could drop this late check by just doing a simpler earlier
> check to see if end_of_old == old_path and end_of_new == new_path
> after the "find first non-equal character" step and before advancing
> to the next '/', and if that condition is found, then return early
> with no match.
>
>
> However, since you highlighted this code, there are two other special
> cases that might be interesting:
>
> 1) What if we are renaming e.g. foo/bar/baz.c ->
> leading/dir/foo/bar/baz.c?  Then after trying to find the first
> non-matching char we'll have end_of_old == old_path and *end_of_new ==
> 'f', and the advancing makes it look like "foo" being renamed to
> "leading/dir/foo".  Since the root directory cannot be renamed (it
> always exists on both sides of history), this probably makes sense as
> the right thing to return.
> 2) What if the renaming went the other way, from
> leading/dir/foo/bar/baz.c -> foo/bar/baz.c?  The whole advancing thing
> makes this look like "leading/dir/foo" being renamed to "foo", instead
> of "leading/dir" being renamed to "" (the root directory).  If we
> don't detect it as "leading/dir" being renamed (merged into) the root
> directory, then new files added directly within leading/dir/ on the
> other side of history won't be moved by directory rename detection
> into the root directory.
>
> > Is my reading correct?
>
> I'm not sure if I've answered your questions; let me know if not.  I
> have generated a couple patches to (1) make the code easier to follow,
> and (2) support the rename/merge of a subdirectory into the root
> directory.  They're waiting for the gitgitgadget CI checks right now,
> then I'll send them to the list.
>