Re: [PATCH 4/9] submodule.c: sort changed_submodule_names before searching it

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 12 Sep 2018 11:18:02 -0700

Stefan Beller <sbeller@xxxxxxxxxx> writes:

> We can string_list_insert() to maintain sorted-ness of the
> list as we find new items, or we can string_list_append() to
> build an unsorted list and sort it at the end just once.
>
> To pick which one is more appropriate, we notice the fact
> that we discover new items more or less in the already
> sorted order.  That makes "append then sort" more
> appropriate.

Sorry, but I still do not get the math you are implying in the
second paragraph.  Are you saying that append-then-sort is efficient
when items being appended is already sorted?  That depends on the
sorting algorithm used, so the logic is incomplete unless you say
"given that we use X for sorting,...", I think.

Do we really discover new items in sorted order, by the way?  In a
single diff invocation made inside collect_changed_submodules() for
one commit in the superproject's history, we will grab changed paths
in the pathname order (i.e. sorted); if the superproject's tip commit
touches the submodules at paths A and Z, we will discover these two
paths in sorted order.

But because we are walking the superproject's history to collect all
paths that have been affected in that function, and repeatedly
calling diff as we discover commit in the superproject's history, I
am not sure how well the resulting set of paths would be sorted.

The tip commit in superproject's history may have modified the
submodule at path X, the parent of that commit may have touched the
submodule at path M, and its parent may have touched the submodule
at path A.  Don't we end up grabbing these paths in that discoverd
order, i.e. X, M and A?

I still think changing it from "insert as we find an item, keeping
the list sorted" to "append all and then sort before we start
looking things up from the result" makes sense, but I do not think
the "we find things in sorted order" is either true, or it would
affect the choice between the two.  A justification to choose the
latter I can think of that makes sense is that we don't have to pay
cost to keep the list sorted while building it because we do not do
any look-up while building the list.

> Signed-off-by: Stefan Beller <sbeller@xxxxxxxxxx>
> ---
>  submodule.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/submodule.c b/submodule.c
> index d29dfa3d1f5..c6eff7699f3 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1256,7 +1256,7 @@ static int get_next_submodule(struct child_process *cp,
>  		case RECURSE_SUBMODULES_DEFAULT:
>  		case RECURSE_SUBMODULES_ON_DEMAND:
>  			if (!submodule ||
> -			    !unsorted_string_list_lookup(
> +			    !string_list_lookup(
>  					&changed_submodule_names,
>  					submodule->name))
>  				continue;
> @@ -1350,6 +1350,7 @@ int fetch_populated_submodules(struct repository *r,
>  	/* default value, "--submodule-prefix" and its value are added later */
>  
>  	calculate_changed_submodule_paths();
> +	string_list_sort(&changed_submodule_names);
>  	run_processes_parallel(max_parallel_jobs,
>  			       get_next_submodule,
>  			       fetch_start_failure,