Re: [PATCH 0/7] Optimization batch 14: trivial directory resolution

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 01 Jul 2021 15:21:14 +0200

On Thu, Jul 01 2021, Elijah Newren via GitGitGadget wrote:

> This series depends textually on ort-perf-batch-12, but is semantically
> independent. (It is both semantically and textually independent of
> ort-perf-batch-13.)

For others following along, that ort-perf-batch-12 is at
https://lore.kernel.org/git/pull.962.v4.git.1623168703.gitgitgadget@xxxxxxxxx/#t
& currently marked as 'will merge to next' in what's cooking.

> Most of my previous series dramatically accelerated cases with lots of
> renames, while providing comparatively minor benefits for cases with few or
> no renames. This series is the opposite; it provides huge benefits when
> there are few or no renames, and comparatively smaller (though still quite
> decent) benefits for cases with many uncached renames.

Sounds good, one thing I haven't seen at a glance is how these
performance numbers compare to the merge-recursive backend. Are we in a
state of reaching parity with it, or pulling ahead?

> [...]
> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28), the
> changes in just this series improves the performance as follows:
>
>                      Before Series           After Series
> no-renames:        5.235 s ±  0.042 s   204.2  ms ±  3.0  ms
> mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s
> just-one-mega:   480.1  ms ±  3.9  ms   364.1  ms ±  7.0  ms
>
>
> As a reminder, before any merge-ort/diffcore-rename performance work, the
> performance results we started with were:
>
> no-renames-am:      6.940 s ±  0.485 s
> no-renames:        18.912 s ±  0.174 s
> mega-renames:    5964.031 s ± 10.459 s
> just-one-mega:    149.583 s ±  0.751 s

I haven't given any of this a detailed look, just a note/question that
(depending on the answer to the "v.s. merge-recursive above") we may
want to consider bumping the default for the diff.renamelimit at some
point along with any major optimizations.

<random musings follow, the tl;dr is above this line :)>

As an aside that we have diff.renamelimit is one of the most "dangerous"
landmines/fork-in-eye/shotgun-to-foot edge cases we have in using diff
as plumbing IMO.

E.g. I somewhat recently had to deal with some 3rd party Go-language
lint plugin that can be configured to enforce lints "as of a commit".
I.e. it does a diff from that commit, sees in any introduced "issues"
are "new", and complains accordingly. The idea is that it allows you to
enforce lints on "only new code", say ignoring the return value of
os.Write(), without insisting that all existing code must be
whitelisted/fixed first.

The problem being two-fold, one that the thing will get slower over time
as we grow history (can't be avoided), but the more subtle one that at
some point we'll bump into the diff.renamelimit, and whatever unlucky
sob does so will find that the lint is now complaining about ALL THE
THINGS, since "old" code is now ending up as "new" to a naïve diff
parser relying on not bumping into the diff.renamelimit.

Arguably bumping the diff.renamelimit would make that sort of problem
worse for plumbing consumers, since they'd have more rope with which to
hang themselves, maybe it's better to step on that landmine early.

Sorry about the digression somewhat pointless but perhaps amusing
digression in the last 4 paragraphs :)

P.S.: I ended up dealing with the Go plugin by not using the "diff"
      feature, but just a one-off giant whitelist of all existing
      instances of stuff it would complain about.