Re: [PATCH 06/11] submodule.c: sort changed_submodule_names before searching it

Stefan Beller <sbeller@xxxxxxxxxx> · Tue, 11 Sep 2018 11:31:55 -0700

On Thu, Sep 6, 2018 at 11:03 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Stefan Beller <sbeller@xxxxxxxxxx> writes:
>
> > Instead of sorting it after we created an unsorted list, we could insert
> > correctly into the list.
>
> It is unclear what problem you are solving, especially with
> subjunctive "could" there.  We are creating an unsorted list and
> then sorting it and you see it as a problem because it is just as
> easy and efficient to do the insertion sort while building up the
> list?  (don't react and answer without reading all the way to the
> end; I think I know what is going on).
>
> > As the unsorted append is in order of cache entry
> > names, this is already sorted if names were equal to paths for submodules.
>
> That may be a statement of a fact, but it is unclear how that fact
> relates to what the code is doing before this patch, or what the
> code updated by this patch is doing.
>
> > As submodule names are often the same as their path, the input is sorted
> > pretty well already, so let's just do the sort afterwards.
>
> It is unclear what (performance?) trade-off this senttence is trying
> to make.  It sounds as if it is claiming this:
>
>         We can string_list_insert() to maintain sorted-ness of the
>         list as we find new items, or we can string_list_append() to
>         build an unsorted list and sort it at the end just once.
>
>         To pick which one is more appropriate, we notice the fact
>         that we discover new items more or less in the already
>         sorted order.  That makes "append then sort" more
>         appropriate.
>
> But is that reasoning sensible?
>
> I'd imagine that append-then-sort would always be more efficient
> than insert-at-the-right-place-as-we-go, and the reason why we
> sometimes would need to do the latter is when we need to look up
> elements in the middle of building the list (e.g. we may want to
> de-dup, which requires us to look up a name from the ones we
> collected so far).

If we come across a (mostly) sorted list, then the insert-at-the-right-place
we'd come across the best case of insertion sort which is O(n), which
sounds better than append-then-sort as our sorting is a merge sort,
which has O(n log n) even in its best case (and needs to copy stuff
into a temp buffer and back).

By having the submodules named after its path, I strongly suspect
we have a mostly sorted list in nearly all cases except some really
interesting corner cases out there.

> And in this application, calculate_changed_submodule_paths()
> discover paths by calling collect_changed_submodules() which finds a
> mapping <submodule name, oid of commits> into a list sorted by
> submodule name, and then goes through that list and builds a list of
> submodules paths (which could be different from submodule names) by
> appending.  Only after this list is fully built, get_next_submodule()
> gets called, so making the latter use string_list_lookup() that assumes
> a sorted list is safe if we built the list by append-then-sort (iow,
> sortedness while building the list does not matter).
>
> Having analysed all that, I find it somewhat iffy that _append() is
> used there in the calculate_changed_submodule_paths() function.

Note that this is fixed in the later patch
"submodule.c: do not copy around submodule list"

>  It
> would cause the resulting changed_submodule_names list to hold the
> same name twice (or more),

This would be possible if there is a submodule at path A and another
submodule (at a different path) named "A", as we'll try hard to collect
names, but are also okay with path as we want to keep supporting the
historical use case of submodules.

> but I do not know if that would pose a
> problem to the consumer of the list.  Using "accumulate then sort
> before calling look-up" would not change it as string_list_sort()
> would not dedup, so I do not think this patch would introduce a new
> problem, though.

Yes, that is true, so we'd want to extend the message above to
mention the potential duplicates.

Thanks,
Stefan