Re: [PATCH v9 6/6] diff-lib: parallelize run_diff_files for submodules

Glen Choo <chooglen@xxxxxxxxxx> · Thu, 16 Mar 2023 18:09:39 -0700

I haven't verified if the code in this version is correct or not, as I
found it a bit difficult to follow through the churn. After reading this
series again, I've established a better mental model of the code, and I
think there are some renames and documentation changes we can make to
make this clearer.

Unfortunately, I think the biggest clarification would be _yet_ another
refactor, and I'm not sure if we actually want to bear so much churn. I
might do this refactor locally to see if it really is _much_ cleaner or
not.

If anyone has thoughts on the refactor, do chime in.

Calvin Wan <calvinwan@xxxxxxxxxx> writes:

> diff --git a/diff-lib.c b/diff-lib.c
> index 744ae98a69..7fe6ced950 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -65,26 +66,41 @@ static int check_removed(const struct index_state *istate, const struct cache_en
>   * Return 1 when changes are detected, 0 otherwise. If the DIRTY_SUBMODULES
>   * option is set, the caller does not only want to know if a submodule is
>   * modified at all but wants to know all the conditions that are met (new
> - * commits, untracked content and/or modified content).
> + * commits, untracked content and/or modified content). If
> + * defer_submodule_status bit is set, dirty_submodule will be left to the
> + * caller to set. defer_submodule_status can also be set to 0 in this
> + * function if there is no need to check if the submodule is modified.
>   */
>  static int match_stat_with_submodule(struct diff_options *diffopt,
>  				     const struct cache_entry *ce,
>  				     struct stat *st, unsigned ce_option,
> -				     unsigned *dirty_submodule)
> +				     unsigned *dirty_submodule, int *defer_submodule_status,
> +				     unsigned *ignore_untracked)
>  {
>  	int changed = ie_match_stat(diffopt->repo->index, ce, st, ce_option);
> +	int defer = 0;
> +
>  	if (S_ISGITLINK(ce->ce_mode)) {
>  		struct diff_flags orig_flags = diffopt->flags;
>  		if (!diffopt->flags.override_submodule_config)
>  			set_diffopt_flags_from_submodule_config(diffopt, ce->name);
> -		if (diffopt->flags.ignore_submodules)
> +		if (diffopt->flags.ignore_submodules) {
>  			changed = 0;
> -		else if (!diffopt->flags.ignore_dirty_submodules &&
> -			 (!changed || diffopt->flags.dirty_submodules))
> -			*dirty_submodule = is_submodule_modified(ce->name,
> -								 diffopt->flags.ignore_untracked_in_submodules);
> +		} else if (!diffopt->flags.ignore_dirty_submodules &&
> +			   (!changed || diffopt->flags.dirty_submodules)) {
> +			if (defer_submodule_status && *defer_submodule_status) {
> +				defer = 1;
> +				*ignore_untracked = diffopt->flags.ignore_untracked_in_submodules;
> +			} else {
> +				*dirty_submodule = is_submodule_modified(ce->name,
> +					 diffopt->flags.ignore_untracked_in_submodules);
> +			}
> +		}
>  		diffopt->flags = orig_flags;
>  	}
> +
> +	if (defer_submodule_status)
> +		*defer_submodule_status = defer;

The crux of this patch is that we are replacing some serial operation
with a parallel operation. The replacement happens here, where we are
replacing is_submodule_modified() by 'deferring' it.

So to verify if the parallel implementation is correct, we should
compare the "setup" and "finish" steps in is_submodule_modified() and
get_submodules_status(). Eyeballing it, it looks correct, especially
because we made sure to refactor out the shared logic in previous
patches.

To reflect this, I think it would be clearer to rename
get_submodules_status() to something similar (e.g.
are_submodules_modified_parallel()), with an explicit comment saying
that it is meant to be a parallel implementation of
is_submodule_modified().

Except, I told a little white lie in the previous paragraph, because
get_submodules_status() isn't _just_ a parallel implementation of
is_submodule_modified()...

> @@ -268,13 +286,52 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>  			}
>  
>  			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
> -							    ce_option, &dirty_submodule);
> +							    ce_option, NULL,
> +							    &defer_submodule_status,
> +							    &ignore_untracked);
>  			newmode = ce_mode_from_stat(ce, st.st_mode);
> +			if (defer_submodule_status) {
> +				struct submodule_status_util tmp = {
> +					.changed = changed,
> +					.dirty_submodule = 0,
> +					.ignore_untracked = ignore_untracked,
> +					.newmode = newmode,
> +					.ce = ce,
> +					.path = ce->name,
> +				};
> +				struct string_list_item *item;
> +
> +				item = string_list_append(&submodules, ce->name);
> +				item->util = xmalloc(sizeof(tmp));
> +				memcpy(item->util, &tmp, sizeof(tmp));
> +				continue;
> +			}

because get_submodules_status() doesn't just contain the results of
the parallel processes, it is _also_ shuttling "changed" and
"ignore_untracked" from match_stat_with_submodule(), as well as
.newmode, .ce and .path from run_diff_files() (basically everything
except .dirty_submodule)...

>  		}
>  
> -		record_file_diff(&revs->diffopt, newmode, dirty_submodule,
> -				 changed, istate, ce);
> +		if (!defer_submodule_status)
> +			record_file_diff(&revs->diffopt, newmode, 0,
> +					   changed,istate, ce);
> +	}
> +	if (submodules.nr) {
> +		unsigned long parallel_jobs;
> +		struct string_list_item *item;
> +
> +		if (git_config_get_ulong("submodule.diffjobs", &parallel_jobs))
> +			parallel_jobs = 1;
> +		else if (!parallel_jobs)
> +			parallel_jobs = online_cpus();
> +
> +		if (get_submodules_status(&submodules, parallel_jobs))
> +			die(_("submodule status failed"));
> +		for_each_string_list_item(item, &submodules) {
> +			struct submodule_status_util *util = item->util;
> +
> +			record_file_diff(&revs->diffopt, util->newmode,
> +					 util->dirty_submodule, util->changed,
> +					 istate, util->ce);
> +		}

so that we can pass all of this back into record_file_diff(). The only
member that is changed by the parallel process is .dirty_submodule,
which is exactly what we would expect from a parallel version of
is_submodule_modified().

If we don't want to do a bigger refactor, I think we should also add
comments to members of "struct submodule_status_util" to document where
they come from and what they are used for.

The rest of the comments are refactor-related.

It would be good if we could avoid mixing unrelated information sources
in "struct submodule_status_util", since a) this makes it very tightly
coupled to run_diff_files() and b) it causes us to repeat ourselves in
the same function (.changed = changed, record_file_diff()).

The only reason why the code looks this way right now is that
match_stat_with_submodule() sets defer_submodule_status based on whether
or not we should ignore the submodule, and this eventually tells
get_submodule_status() what submodules it needs to care about. But,
deciding whether to spawn a subprocess for which submodule is exactly
what the .get_next_task member is for.

> diff --git a/submodule.c b/submodule.c
> index 426074cebb..6f6e150a3f 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -1981,6 +1994,121 @@ unsigned is_submodule_modified(const char *path, int ignore_untracked)
>  	return dirty_submodule;
>  }
>  
> +static struct status_task *
> +get_status_task_from_index(struct submodule_parallel_status *sps,
> +			   struct strbuf *err)
> +{
> +	for (; sps->index_count < sps->submodule_names->nr; sps->index_count++) {
> +		struct submodule_status_util *util = sps->submodule_names->items[sps->index_count].util;
> +		struct status_task *task;
> +
> +		if (!verify_submodule_git_directory(util->path))
> +			continue;

So right here, we could use the "check if this submodule should be
ignored" logic form match_stat_with_submodule() to decide whether or not
to spawn the subprocess. IOW, I am advocating for
get_submodules_status() to be a parallel version of
match_stat_with_submodule() (not a parallel version of
is_submodule_modified() that shuttles extra information).

Another sign that this refactor is a good idea is that it lets us
simplify _existing_ submodule logic in run_diff_files(). Prior to this
patch, we have:

      unsigned dirty_submodule = 0;
      ...
			changed = match_stat_with_submodule(&revs->diffopt, ce, &st,
							    ce_option, NULL,
							    &defer_submodule_status,
							    &ignore_untracked);
      // If submodule was deferred, shuttle a bunch of information
      // If not, call record_file_diff()

but the body of match_stat_with_submodule() is just ie_match_stat() +
some additional submodule logic. Post refactor, this would look
something like:

    struct string_list submodules;
    ...
    // For any submodule, just append it to a list and let the
    // parallel thing take care of it.
    if (S_ISGITLINK(ce->ce_mode) {
      // Probably pass .newmode and .ce to the util too...
      string_list_append(submodules, ce->name);
    } else {
      changed = ie_match_stat(foo, bar, baz);
      record_file_diff();
    }
    ...
    if (submodules.nr) {
      parallel_match_stat_with_submodule_wip_name(&submodules);
      for_each_string_list_item(item, &submodules) {
        record_file_diff(&item);
      }
    }

Which I think is easier to follow, since we won't need
defer_submodule_status any more, and we don't shuttle information from
match_stat_with_submodule(). Though I'm a bit unhappy that it's still
pretty coupled to run_diff_files() (it still has to shuttle .newmode,
.ce). Also, I don't think this refactor lets us avoid the refactors we
did in the previous patches.

> +
> +		task = xmalloc(sizeof(*task));
> +		task->path = util->path;
> +		task->ignore_untracked = util->ignore_untracked;
> +		strbuf_init(&task->out, 0);
> +		sps->index_count++;
> +		return task;
> +	}
> +	return NULL;
> +}
> +
> +static int get_next_submodule_status(struct child_process *cp,
> +				     struct strbuf *err, void *data,
> +				     void **task_cb)
> +{
> +	struct submodule_parallel_status *sps = data;
> +	struct status_task *task = get_status_task_from_index(sps, err);

As an aside, I think we can inline get_status_task_from_index(). I
suspect this pattern was copied from get_next_submodule(), which
gets fetch tasks from two different places (hence _from_index and
_from_changed), but here I don't think we will ever get status tasks
from more than one place.