Re: [PATCH v2 2/4] doc: clarify documentation for rename/copy limits

Elijah Newren <newren@xxxxxxxxx> · Wed, 14 Jul 2021 09:30:28 -0700

On Wed, Jul 14, 2021 at 12:47 AM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
>
> On Wed, Jul 14 2021, Elijah Newren via GitGitGadget wrote:
>
> > From: Elijah Newren <newren@xxxxxxxxx>
> >
...
> > diff --git a/Documentation/config/diff.txt b/Documentation/config/diff.txt
> > index 2d3331f55c2..e26a63d0d42 100644
> > --- a/Documentation/config/diff.txt
> > +++ b/Documentation/config/diff.txt
> > @@ -118,9 +118,10 @@ diff.orderFile::
> >       relative to the top of the working tree.
> >
> >  diff.renameLimit::
> > -     The number of files to consider when performing the copy/rename
> > -     detection; equivalent to the 'git diff' option `-l`. This setting
> > -     has no effect if rename detection is turned off.
> > +     The number of files to consider in the exhaustive portion of
> > +     copy/rename detection; equivalent to the 'git diff' option
> > +     `-l`.  If not set, the default value is 400.  This setting has
> > +     no effect if rename detection is turned off.
>
> For both this...
>
> >  diff.renames::
> >       Whether and how Git detects renames.  If set to "false",
> > diff --git a/Documentation/config/merge.txt b/Documentation/config/merge.txt
> > index 6b66c83eabe..aca0c92dbe6 100644
> > --- a/Documentation/config/merge.txt
> > +++ b/Documentation/config/merge.txt
> > @@ -33,10 +33,12 @@ merge.verifySignatures::
> >  include::fmt-merge-msg.txt[]
> >
> >  merge.renameLimit::
> > -     The number of files to consider when performing rename detection
> > -     during a merge; if not specified, defaults to the value of
> > -     diff.renameLimit. This setting has no effect if rename detection
> > -     is turned off.
> > +     The number of files to consider in the exhaustive portion of
> > +     rename detection during a merge.  If not specified, defaults
> > +     to the value of diff.renameLimit.  If neither
> > +     merge.renameLimit nor diff.renameLimit are specified, defaults
> > +     to 1000.  This setting has no effect if rename detection is
> > +     turned off.
>
> ...and this we should really have some wording to the effect of:
>
>     ..., defaults to XYZ. The exact (or even approximate) default of XYZ
>     should not be relied upon, and may be changed (or these limits even
>     removed) in future versions of git.
>
> I.e. let's distinguish a "this is how it works now, FYI" from a forward
> promise that it's going to work like that forever.

Perhaps I could simplify that to "...currently defaults to XYZ"?  Does
that capture your intent well enough?

I agree very much that this shouldn't be a hard promise.  If it was,
we've broken it repeatedly over the years.  Not only have we modified
the default numbers multiple times, there have also been multiple
times when we've been able to change how many renames could be handled
without changing the default numbers for the rename limits.  (This was
done by adding or improving special cases that allow us to exclude
paths from the exhaustive comparison; exact rename detection that
someone else added and my more recent improvements to exact rename
detection are two simple examples).

> Which also leads me to:
>
> >  merge.renames::
> >       Whether Git detects renames.  If set to "false", rename detection
> > diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
> > index 32e6dee5ac3..11e08c3fd36 100644
> > --- a/Documentation/diff-options.txt
> > +++ b/Documentation/diff-options.txt
> > @@ -588,11 +588,12 @@ When used together with `-B`, omit also the preimage in the deletion part
> >  of a delete/create pair.
> >
> >  -l<num>::
> > -     The `-M` and `-C` options require O(n^2) processing time where n
> > -     is the number of potential rename/copy targets.  This
> > -     option prevents rename/copy detection from running if
> > +     The `-M` and `-C` options have an exhaustive portion that
> > +     requires O(n^2) processing time where n is the number of
> > +     potential rename/copy targets.  This option prevents the
> > +     exhaustive portion of rename/copy detection from running if
> >       the number of rename/copy targets exceeds the specified
> > -     number.
> > +     number.  Defaults to diff.renameLimit.
>
> This mention of O(n^2). This is not an issue with your patch/series, nd
> this is an improvement.
>
> But as a side-comment: Do we explain somewhere how exactly this
> {diff,merge}.renmeLimit=N works for values of N? It's probably fine to
> continue to handwave it away, but is there anything that say tells a
> user what happens if they have 400 files in their repository in a "x"
> directory and "git mv x y"? Will it work, but if they add a 401th file
> it won't for diffs?
>
> I *think* given a skimming of previous discussions that it's "no",
> i.e. that this just kicks in for these expensive cases, and e.g. for
> such a case of moving the same tree OID from "x" to "y" we'd
> short-circuit things, but maybe I'm just making things up at this point.

Tree OIDs aren't used by the rename detection logic, but you're close.

> Feel free to ignore me here, I guess this amounts to a request that it
> would be great if we could point to some docs about how this works. It's
> probably too much detail for Documentation/config/{merge,diff}.txt, so
> maybe a section in either of git-{diff,merge}.txt ?

If git-merge.txt includes it, then rebase, cherry-pick, revert
probably should too.  (and maybe the -3 option of git-am)

Anyway...

In the "git mv dir-x dir-y" case, if no other changes are made to
those files, then the renames would be detectable via blobs having the
same hashes (i.e. exact renames).  So all these renames would be found
regardless of the number of files and without exhaustive detection
even kicking in.

If there were thousands of such files, and every one of them were also
modified, then the basename-guided rename detection which operates in
linear time would still find all the renames, without the exhaustive
detection even kicking in (see commits bd24aa2f97a0 (diffcore-rename:
guide inexact rename detection based on basenames, 2021-02-14) and
37a251436447 (diffcore-rename: use directory rename guided basename
comparisons, 2021-02-27)).

If this were a merge, and there were thousands of such files renamed
and modified, AND they were all further renamed to change their
basename, BUT the original directory was unmodified on the other side
of history, then we wouldn't need any of the renames for the purposes
of the merge and they'd all be excluded from the exhaustive rename
detection.  Sure, we wouldn't find renames for those files, but that
wouldn't change the result of the merge.  So, again, no exhaustive
rename detection.  (See commit 174791f0fb23 (merge-ort: use
relevant_sources to filter possible rename sources, 2021-03-11))

(Rebases & cherry-picks also have an advantage of caching renames from
previous picks on the upstream side of history to avoid needing to
re-detect renames on that side...though it only caches renames that
are either relevant or exact.)

Basically, we bail on exhaustive rename detection if, after the linear
or faster steps have run (excluding previously cached renames, exact
rename detection, skipping irrelevant renames, basename guided rename
detection), the number of remaining unpaired source files times the
number of remaining unpaired destination files is greater than
rename_limit * rename_limit.  (If we can assume the numbers match,
i.e. N = number of remaining unpaired sources = number of remaining
unpaired destinations, then that check simplifies to bailing once N >
rename_limit)

I'm not sure if these details are really going to help the users,
especially if more special cases are added (and more have been
proposed by Johannes and became an Outreachy project, though that one
didn't get off the ground).  And this explanation is not even fully
detailed; I'm still hand waving here in at least four different ways
(mostly in ignoring special cases for different fast steps, but also
by ignoring copy detection that itself messes with the explanation in
multiple ways).

But maybe someone else can figure out a way to hand wave my hand
waving down to a simple enough explanation, while also covering copy
detection, and managing to do so in a way that doesn't mislead.  If
so, we can include that in the docs somewhere.