Re: [PATCH 1/5] merge-ort: replace string_list_df_name_compare with faster alternative

Elijah Newren <newren@xxxxxxxxx> · Fri, 28 May 2021 11:09:25 -0700

On Fri, May 28, 2021 at 9:12 AM René Scharfe <l.s.r@xxxxxx> wrote:
>
> Am 28.05.21 um 00:47 schrieb Elijah Newren:
> > On Thu, May 27, 2021 at 2:00 PM René Scharfe <l.s.r@xxxxxx> wrote:
> >>
> >> Am 27.05.21 um 10:37 schrieb Elijah Newren via GitGitGadget:
...
> >>>  /*** Function Grouping: functions related to process_entries() ***/
> >>>
> >>> -static int string_list_df_name_compare(const char *one, const char *two)
> >>> +static int sort_dirs_next_to_their_children(const void *a, const void *b)
> >>>  {
> >>> -     int onelen = strlen(one);
> >>> -     int twolen = strlen(two);
> >>
> >> The old code scans both strings fully, while the new one stops when it
> >> reaches a difference and doesn't look at any further characters.  How
> >> much does that contribute to the speedup?  (I suspect a lot.)
> >
> > Oh, indeed, good catch.  It appears to be responsible for essentially all of it.
>
> Then you can keep the original function signature (as well as the use of
> string_list_sort) and avoid explicit casts.

Fair enough; though I'll probably still apply the function rename,
even if I keep the old signature.  The function has a completely
different purpose in merge-ort.c than it does in merge-recursive.c.

> The same function exists in merge-recursive.c, by the way.  I suspect we
> could avoid sorting entirely there by taking advantage of the index
> order and a mechanism like the one in the second half of
> fsck.c::verify_ordered().  That's a bit tricky, though (for me anyway).
>
> All tests still pass when I replace string_list_df_name_compare() with
> strcmp() in merge-recursive.c, so the first thing needed would be tests
> that highlight the difference between those comparison functions,
> however.  Not sure if it's worth it -- merge-recursive is on its way
> out, right?

Interesting idea (and interesting that the function might not matter
anymore in merge-recursive.c and might be replaceable by strcmp), but
yeah merge-recursive.c is on its way out.  In fact, there are two
other reasons to also not modify merge-recursive.c along these lines:
(1) We want to keep merge-recursive stable right now as we push people
to try out merge-ort, so that we have a good comparison point if they
run into problems; if we modify merge-recursive, especially in tricky
ways, then our comparison point might get invalidated.  (2) This
codepath is relatively hot in merge-ort.c because it has optimized all
the other stuff, so saving approximately half a second is significant.
In merge-recursive.c, saving half a second is a small blip in the
overall timing that would be lost in the run-to-run variability.
(Saving 0.7s off of 10.127 s ±  0.073 s is huge; but removing that
much time from 5964.031 s ± 10.459 is wasted effort.)

> Not sure if d/f conflicts could also be detected in merge-ort.c without
> sorting -- the original order is lost when the paths are thrown into
> a strmap.

While string_list_df_name_compare() in merge-recursive.c is used for
the purpose of detecting D/F conflicts, this function in merge-ort.c
had nothing to do with detecting D/F conflicts.  In fact, D/F
conflicts were already detected in collect_merge_info_callback() on
this line:

    unsigned df_conflict = (filemask != 0) && (dirmask != 0);

and recorded in setup_path_info(), which was done long before it got
to this point in the code.  The point of this sorting in merge-ort.c
is just to put paths in an order that allows us to write out trees as
we process them and record the resulting hash for the next directory
up the chain as we go.