Re: [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 15 Jul 2021 11:09:59 -0400

On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@xxxxxxxxx>
...
> Often, I noticed that when the optimization does not apply, it is
> because there are a handful of relevant sources -- maybe even only one.
> It felt frustrating to need to recurse into potentially hundreds or even
> thousands of directories just for a single rename, but it was needed for
> correctness.
> 
> However, staring at this list of functions and noticing that
> process_entries() is the most expensive and knowing I could avoid it if
> I had cached renames suggested a simple idea: change
>    collect_merge_info()
>    detect_and_process_renames()
>    process_entries()
> into
>    collect_merge_info()
>    detect_and_process_renames()
>    <cache all the renames, and restart>
>    collect_merge_info()
>    detect_and_process_renames()
>    process_entries()
> 
> This may seem odd and look like more work.  However, note that although
> we run collect_merge_info() twice, the second time we get to employ
> trivial directory resolves, which makes it much faster, so the increased
> time in collect_merge_info() is small.  While we run
> detect_and_process_renames() again, all renames are cached so it's
> nearly a no-op (we don't call into diffcore_rename_extended() but we do
> have a little bit of data structure checking and fixing up).  And the
> big payoff comes from the fact that process_entries(), will be much
> faster due to having far fewer entries to process.

I enjoyed the story you tell here.

> +	if (path_count_after) {
> +		/*
> +		 * Not sure were the right cut-off is for the optimization
> +		 * to redo collect_merge_info after we've cached the
> +		 * regular renames is.  Basically, collect_merge_info(),
> +		 * detect_regular_renames(), and process_entries() are
> +		 * similar costs and all big tent poles.  Caching the
> +		 * result of detect_regular_renames() means that redoing
> +		 * that one function will cost us virtually 0 extra, so it
> +		 * depends on the other two functions, which are both O(N)
> +		 * cost in the number of paths.  Thus, it makes sense that
> +		 * if we can cut the number of paths in half, then redoing
> +		 * collect_merge_info() at half cost in order to get
> +		 * process_entries() at half cost should be about equal
> +		 * cost.  If we can cut by more than half, then we would
> +		 * win.  The fact that process_entries() is about 10%-20%
> +		 * more expensive than collect_merge_info() suggests we
> +		 * could make the factor be less than two.  The fact that
> +		 * even when we have renames cached, we still have to
> +		 * traverse down to the individual (relevant) renames,
> +		 * which suggests we should perhaps use a bigger factor.
> +		 *
> +		 * The exact number isn't critical, since the code will
> +		 * work even if we get the factor wrong -- it just might be
> +		 * slightly slower if we're a bit off.  For now, just error
> +		 * on the side of a bigger fudge.  For the linux kernel

super-nit: s/linux/Linux/

> +		 * testcases I was looking at with massive renames, the
> +		 * ratio came in around 50 to 250, which clearly would
> +		 * trigger this optimization and provided some *very* nice
> +		 * speedups.

This bit of your testing might be more appropriate for your commit
message. This discussion of a test made at a certain point in time
is more likely to go stale than the description of how this factor
does not change correctness, only performance.

> +		 */
> +		int wanted_factor = 3;

and perhaps make it 'const'?

> +
> +		/* We should only redo collect_merge_info one time */
> +		assert(renames->redo_after_renames == 0);
> +
> +		if (path_count_after / path_count_before > wanted_factor) {

With truncation from integer division, this condition is equivalent* to

	path_count_after >= 4 * path_count_before

Or, do you want to change this to a ">=" so that the factor of 3 seems
more accurate?

*I understand the intention of using division to avoid (unlikely)
overflow via multiplication. The truncation is causing some confusion.

> -test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
> +test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
>  	test_setup_12f &&
>  	(
>  		cd 12f &&
> 

Oh, and a bonus test success! Excellent.

Thanks,
-Stolee