Re: [PATCH 1/5] merge-ort: replace string_list_df_name_compare with faster alternative

Elijah Newren <newren@xxxxxxxxx> · Thu, 27 May 2021 21:10:27 -0700

On Thu, May 27, 2021 at 6:32 PM Taylor Blau <me@xxxxxxxxxxxx> wrote:
>
> On Thu, May 27, 2021 at 08:37:17AM +0000, Elijah Newren via GitGitGadget wrote:
> > -static int string_list_df_name_compare(const char *one, const char *two)
> > +static int sort_dirs_next_to_their_children(const void *a, const void *b)
>
> I looked at the new implementation of this function (and
> df_name_compare() to double check it) and am convinced that it's
> correctness, with the exception of one question.
>
> Some thoughts I had while trying to make sure this was all OK:
>
> > +     const char *one = ((struct string_list_item *)a)->string;
> > +     const char *two = ((struct string_list_item *)b)->string;
> > +     unsigned char c1, c2;
> > +
> > +     while (*one && (*one == *two)) {
> > +             one++;
> > +             two++;
> > +     }
>
> Advancing 'one' and 'two' to point at either the end of 'a' (and the
> same position within 'b'), or the first place where the two have
> different characters. If 'b' is shorter than 'a', '*one != *two' will
> terminate the loop (since '*two' will be NUL, and '*one' will not).
>
> > +     c1 = *one;
> > +     if (!c1)
> > +             c1 = '/';
> > +
> > +     c2 = *two;
> > +     if (!c2)
> > +             c2 = '/';
>
> Store off the last character of each, or '/' if we got to the end. Hmm,
> is this right (the guard in 'df_name_compare()' read 'if (!c1 &&
> S_ISDIR(mode1))'). Suppose both strings were "foo", then both c1 and c2
> would be "/", and I think we would return -1.
>
> That doesn't seem quite right to me. I think it *would* be right if we
> checked the mode of each entry before assigning c1 or c2 to '/',
> though. (Being generally unfamiliar with this area, I haven't looked to
> see whether getting access to the modes of each entry at this point is
> easy or not).

Good reasoning; I should have been clearer about one of the
assumptions that this function operates under which precludes that
possibility.

> > +
> > +     if (c1 == c2) {
> > +             /* Getting here means one is a leading directory of the other */

Your example case of both strings being "foo" obviously conflicts with
this comment; but the comment is correct.  This function will never be
given two equal strings because it is called to sort the keys of a
strmap, and strmap keys are unique by construction.  (If one side of
history has "foo" as a directory, and the other side has "foo" as a
path, then there is still only one "foo" in opt->priv->paths.  Every
entry in opt->priv->paths records 3 hashes and modes and whatnot in
order to know what each side of history had at the given path.)

Also, even in that case (when two strings are equal), getting the
right return value would only matter if we cared about a stable sort.
But we call this function with QSORT, not STABLE_QSORT.

Interestingly, this function technically doesn't even need to fully
sort the array either.  For example, if you took the output of 'git
ls-tree -rt HEAD' and permuted the order of files within the same
directory, that kind of level of quasi-sorted would be good enough for
my purposes; I just need files underneath the same directory to be
"together" and the containing directory to appear immediately before
those.  There is a later call to write_tree() that will hande sorting
within a single directory to ensure fully-sorted-ness.  Unfortunately,
I don't know of a way to take advantage of that less strict sorting
requirement for this point of the code to improve the performance
further, so this function just implemented "pretend that every path
has a '/' appended and then fully sort them".

> > +             return (*one) ? 1 : -1;
> > +     }
>
> > +     else
>
> I did find this spacing awkward (would have expected '} else' or even '}
> else {'), but it hardly matters.

I'll fix that up while adding some comments about the purpose of the
function and the fact that it assumes there will be no duplicates in
the array being sorted.