Re: [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 12 Feb 2021 17:32:59 -0800

"Elijah Newren via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> +MAYBE_UNUSED
> +static int find_basename_matches(struct diff_options *options,
> +				 int minimum_score,
> +				 int num_src)
> +{
> +	int i;
> +	struct strintmap sources;
> +	struct strintmap dests;
> +
> +	/* Create maps of basename -> fullname(s) for sources and dests */
> +	strintmap_init_with_options(&sources, -1, NULL, 0);
> +	strintmap_init_with_options(&dests, -1, NULL, 0);
> +	for (i = 0; i < num_src; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		const char *base;
> +
> +		/* exact renames removed in remove_unneeded_paths_from_src() */
> +		assert(!rename_src[i].p->one->rename_used);
> +
> +		/* Record index within rename_src (i) if basename is unique */
> +		base = get_basename(filename);
> +		if (strintmap_contains(&sources, base))
> +			strintmap_set(&sources, base, -1);
> +		else
> +			strintmap_set(&sources, base, i);
> +	}
> +	for (i = 0; i < rename_dst_nr; ++i) {
> +		char *filename = rename_dst[i].p->two->path;
> +		const char *base;
> +
> +		if (rename_dst[i].is_rename)
> +			continue; /* involved in exact match already. */
> +
> +		/* Record index within rename_dst (i) if basename is unique */
> +		base = get_basename(filename);
> +		if (strintmap_contains(&dests, base))
> +			strintmap_set(&dests, base, -1);
> +		else
> +			strintmap_set(&dests, base, i);
> +	}
> +
> +	/* TODO: Make use of basenames source and destination basenames */

;-)

So at this point sources and dests can be used to quickly look up,
given a filename, if there is a single src among all sources, and a
single dst among all dests, that have the filename.

I wonder if the second loop over destinations can be "optimized"
further by using the sources map, though.  The reason you quash
entries with -1 when you see second instance of the same name is
because you intend to limit the heuristics only to a uniquely named
file among the removed files going to a uniquely named file among
the added files, right?  So even if a name is unique among dests,
if that name has duplicates on the source side, there is no point
recording its location.  i.e.

	/* record index within dst if it is unique in both dst and src */
	base = get_basename(filename);
	if (strintmap_contains(&sources, base) ||
	    strintmap_contains(&dests, base))
		strintmap_set(&dests, base, -1);
	else
		strintmap_set(&dests, base, i);

perhaps?

I guess it depends on what actually will be written in this "TODO"
space how effective such a change would be.  Presumably, you'd
iterate over &sources while skipping entries that record -1, to
learn (basename, i), and use the basename found there to consult
&dests to see if it yields a non-negative integer j, to notice that
rename_src[i] is a good candidate to match rename_dst[j].  If that
is the case, then such a change won't help as an optimization at
all, as we'd need to consult &dests map with the basename anyway,
so let's scratch the above idea.

In any case, after we walk over rename_src[] and rename_dst[] once,
the number of entries in &sources would be smaller than rename_src[]
so iterating over &sources, hunting for entries that record
non-negative index into rename_src[] would hopefully be cheaper than
the naive loop we've been working with.  I like the idea of using the
strintmap for this part of the code.

Thanks.

> +	strintmap_clear(&sources);
> +	strintmap_clear(&dests);
> +
> +	return 0;
> +}
> +
>  #define NUM_CANDIDATE_PER_DST 4
>  static void record_if_better(struct diff_score m[], struct diff_score *o)
>  {