Re: [PATCH v7 19/31] merge-recursive: add get_directory_renames()

Stefan Beller <sbeller@xxxxxxxxxx> · Mon, 5 Feb 2018 11:39:43 -0800

>     /*
>      * For
>      *    "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
>      * the "e/foo.c" part is the same, we just want to know that
>      *    "a/b/c/d" was renamed to "a/b/some/thing/else"
>      * so, for this example, this function returns "a/b/c/d" in
>      * *old_dir and "a/b/some/thing/else" in *new_dir.
>      *
>      * Also, if the basename of the file changed, we don't care.  We
>      * want to know which portion of the directory, if any, changed.
>      */
>
> Is that better?

yes, that helps me understanding pretty much.

>
>>> +        *
>>> +        * Also, if the basename of the file changed, we don't care.  We
>>> +        * want to know which portion of the directory, if any, changed.
>>> +        */
>>> +       end_of_old = strrchr(old_path, '/');
>>> +       end_of_new = strrchr(new_path, '/');
>>> +
>>> +       if (end_of_old == NULL || end_of_new == NULL)
>>> +               return;
>>
>> return early as the files are in the top level, and apparently we do
>> not support top level renaming?
>>
>> What about a commit like 81b50f3ce4 (Move 'builtin-*' into
>> a 'builtin/' subdirectory, 2010-02-22) ?
>>
>> Well that specific commit left many files outside the new builtin dir,
>> but conceptually we could see a directory rename of
>>
>>     /* => /src/*
>
> We had some internal usecases for want to support a "rename" of the
> toplevel directory into a subdirectory of itself.  However, attempting
> to support that opened much too big a can of worms for me.  We'd open
> up some big surprises somewhere.

ok, I was just trying to understand the code.
(As said before, I was not asking for having this implemented.)

> In particular, note that not supporting a "rename" of the toplevel
> directory is a special case of not supporting a "rename" of any
> directory to a subdirectory below itself, which in turn is a very
> special case of excluding partial directory renames.  I addressed this
> in the cover letter of my original submission, as follows:
>
> """
> Further, there's a basic question about when directory rename detection
> should be applied at all.  I have a simple rule:
>
>   3) If a given directory still exists on both sides of a merge, we do
>      not consider it to have been renamed.
>
> Rule 3 may sound obvious at first, but it will probably arise as a
> question for some users -- what if someone "mostly" moved a directory but
> still left some files around, or, equivalently (from the perspective of the
> three-way merge that merge-recursive performs), fully renamed a directory
> in one commmit and then recreated that directory in a later commit adding
> some new files and then tried to merge?  See the big comment in section 4
> of the new t6043 for further discussion of this rule.
> """
>
> Patch 04/31 is the one that adds that big comment with further discussion.
>
> Maybe there's a way to support toplevel renames, but I think it'd make
> this series longer and more complicated...and may cause more edge
> cases that confuse users.

Thanks for reminding!

>
>>> +       while (*--end_of_new == *--end_of_old &&
>>> +              end_of_old != old_path &&
>>> +              end_of_new != new_path)
>>> +               ; /* Do nothing; all in the while loop */
>>
>> We have to compare manually as we'd want to find
>> the first non-equal and there doesn't seem to be a good
>> library function for that.
>>
>> Assuming many repos are UTF8 (including in their paths),
>> how does this work with display characters longer than one char?
>> It should be fine as we cut at the slash?
>
> Oh, UTF-8.  Ugh.
> Can UTF-8 characters, other than '/', have a byte whose value matches
> (unsigned char)('/')?  If so, then I'll need to figure out how to do
> utf-8 character parsing.  Anyone have pointers?

Oh right we are always cutting at '/', which hex 2F, so we cannot split
a codepoint in half accidentally (finding the slash inside a codepoint).

And renaming a directory from one utf8 codepoint to another, which
have the same prefix is not an issue at this layer, too.
(Think of abc -> abd just all of abc/d in one code point, but there
but even for multi code points/ASCII we repeat the prefix when
printing the rename)

>> So the first loop is about counting the number of files in each directory
>> that are renamed and the later while loop is about mapping them?
>
> Close; would adding the following comment at the top of the function help?
>
>     /*
>      * Typically, we think of a directory rename as all files from a
>      * certain directory being moved to a target directory.  However,
>      * what if someone first moved two files from the original
>      * directory in one commit, and then renamed the directory
>      * somewhere else in a later commit?  At merge time, we just know
>      * that files from the original directory went to two different
>      * places, and that the bulk of them ended up in the same place.
>      * We want each directory rename to follow the bulk of the files
>      * from that directory.  This function exists to find where the
>      * bulk of the files went.
>      *
>      * The first loop below simply iterates through the list of
>      * renames, adding up all the rename source->target locations with
>      * a count.
>      *
>      * The second loop compares the count for each renamed directory
>      * and declares the "winning" target location.
>      */
>
> Is there any part that remains unclear with that comment?  (Also, is
> that comment too long?)

Sounds good to me. Thanks!

>
>>> +               /* Strings were xstrndup'ed before inserting into string-list,
>>> +                * so ask string_list to remove the entries for us.
>>> +                */
>>
>> comment style.
>
> Thanks.
>
>>> +               entry->possible_new_dirs.strdup_strings = 1;
>>
>> Why do we need to set strdup_strings here (so late in the
>> game, we are about to clear it?) Could we set it earlier?
>>
>> Or rather have the string list duplicate the strings instead of
>> get_renamed_dir_portion ?
>
> We didn't strdup the original string (a file path) as-is, we
> strndup'ed a subset of the original string (just the relevant portion
> of the directory).  Since we already had to allocate a special string
> for that, making the string list duplicate the strings would have
> caused an extra unnecessary allocation and required us to free the
> memory allocated by get_renamed_dir_portion manually.  So, we do need
> this here.
>
> Does this comment make it clearer?:
>
>         /*
>          * The relevant directory sub-portion of the original full
>          * filepaths were xstrndup'ed before inserting into
>          * possible_new_dirs, and instead of manually iterating the
>          * list and free'ing each, just lie and tell
>          * possible_new_dirs that it did the strdup'ing so that it
>          * will free them for us.
>          */
>         entry->possible_new_dirs.strdup_strings = 1;
>         string_list_clear(&entry->possible_new_dirs, 1);

Phrased this way, it makes sense.

Thanks,
Stefan