Re: [PATCH v8 6/6] diff-lib: parallelize run_diff_files for submodules

Glen Choo <chooglen@xxxxxxxxxx> · Mon, 13 Feb 2023 16:36:18 +0800

Calvin Wan <calvinwan@xxxxxxxxxx> writes:

> @@ -244,6 +266,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			newmode = ce->ce_mode;
>  		} else {
>  			struct stat st;
> +			unsigned ignore_untracked = 0;
> +			int defer_submodule_status = 1;
>  
>  			changed = check_removed(istate, ce, &st);
>  			if (changed) {

Previously [1] it wasn't entirely clear whether we intended to always
parallelize submodule diffing, but now it seems that we always try to
parallelize. In essence, this means that we don't have a serial
implementation any more, but maybe that's okay.

[1] https://lore.kernel.org/git/kl6lilgtveoe.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

> @@ -265,14 +289,53 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			}
>  
>  			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
> -							    ce_option, &dirty_submodule);
> +							    ce_option, &dirty_submodule,
> +							    &defer_submodule_status,
> +							    &ignore_untracked);

Here we get the 'changed' bit of the submodule. Because we always defer,
we never call is_submodule_modified() inside
match_stat_with_submodule() and as such, we never set "dirty_submodule"
here. If so, could we remove the variable altogether?

>  			newmode = ce_mode_from_stat(ce, st.st_mode);
> +			if (defer_submodule_status) {
> +				struct submodule_status_util tmp = {
> +					.changed = changed,
> +					.dirty_submodule = 0,
> +					.ignore_untracked = ignore_untracked,
> +					.newmode = newmode,
> +					.ce = ce,
> +					.path = ce->name,
> +				};
> +				struct string_list_item *item;
> +
> +				item = string_list_append(&submodules, ce->name);
> +				item->util = xmalloc(sizeof(tmp));
> +				memcpy(item->util, &tmp, sizeof(tmp));
> +				continue;
> +			}
>  		}
>  
>  		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
>  				       changed, istate, ce))

I'm surprised to see that we still call "diff_change_helper()" even
though we've 'deferred' the submodule diff, especially since "changed"
is set and "dirty_submodule" is unset. Even if this is safe, I think we
shouldn't do this because...

> +	if (submodules.nr) {
> +		unsigned long parallel_jobs;
> +		struct string_list_item *item;
> +
> +		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
> +			parallel_jobs = 1;
> +		else if (!parallel_jobs)
> +			parallel_jobs = online_cpus();
> +
> +		if (get_submodules_status(&submodules, parallel_jobs))
> +			die(_("submodule status failed"));
> +		for_each_string_list_item(item, &submodules) {
> +			struct submodule_status_util *util = item->util;
> +
> +			if (diff_change_helper(&revs->diffopt, util->newmode,
> +				       util->dirty_submodule, util->changed,
> +				       istate, util->ce))

Here we call "diff_change_helper()" again on the deferred submodule, but
now with the "dirty_submodule" value we expected. At best this is
wasteful, but at worst this is possibly wrong.

For good measure, I applied this patch to see if we needed either
"dirty_submodule" or the second "diff_change_helper()" call; our
test suite still passes after I remove both of them.

  diff --git a/diff-lib.c b/diff-lib.c
  index 2dde575ec6..21adcc7fd6 100644
  --- a/diff-lib.c
  +++ b/diff-lib.c
  @@ -156,6 +156,7 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
      struct cache_entry *ce = istate->cache[i];
      int changed;
      unsigned dirty_submodule = 0;
  +		int defer_submodule_status = 1;

      if (diff_can_quit_early(&revs->diffopt))
        break;
  @@ -267,7 +268,6 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
      } else {
        struct stat st;
        unsigned ignore_untracked = 0;
  -			int defer_submodule_status = 1;

        changed = check_removed(istate, ce, &st);
        if (changed) {
  @@ -311,9 +311,9 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
        }
      }

  -		if (diff_change_helper(&revs->diffopt, newmode, dirty_submodule,
  -				       changed, istate, ce))
  -			continue;
  +		if (!defer_submodule_status)
  +			diff_change_helper(&revs->diffopt, newmode, 0,
  +					   changed, istate, ce);
    }
    if (submodules.nr) {
      unsigned long parallel_jobs;

> +static void parse_status_porcelain_strbuf(struct strbuf *buf,
> +				   unsigned *dirty_submodule,
> +				   int ignore_untracked)
> +{
> +	struct string_list list = STRING_LIST_INIT_DUP;
> +	struct string_list_item *item;
> +
> +	string_list_split(&list, buf->buf, '\n', -1);
> +
> +	for_each_string_list_item(item, &list) {
> +		if (parse_status_porcelain(item->string,
> +					   strlen(item->string),
> +					   dirty_submodule,
> +					   ignore_untracked))
> +			break;
> +	}
> +	string_list_clear(&list, 0);
> +}

Given that this function only has one caller, is quite simple, and isn't
actually a strbuf version of "parse_status_porcelain()" (it's actually a
multiline version that also happens to accept a strbuf), I think this
might be better inlined.

> +test_expect_success 'status in superproject with submodules (parallel)' '
> +	git -C super status --porcelain >output &&
> +	git -C super -c submodule.diffJobs=8 status --porcelain >output_parallel &&
> +	diff output output_parallel
> +'
> +
>  test_done

When I first suggested this test, I thought that we would sometimes
defer submodule status and sometimes not, so this would be a good way to
check between both modes. Now this is less useful, since this is only
checking that parallelism > 1 doesn't affect the output, but it's still
a useful reasonableness check IMO. Thanks.