Re: Diff rename detection performance issues

Elijah Newren <newren@xxxxxxxxx> · Mon, 24 Feb 2025 08:30:00 -0800

On Sun, Feb 23, 2025 at 2:30 AM Devste Devste <devstemail@xxxxxxxxx> wrote:
>
> I have a merge commit that includes 2 modified (!) files:

What do you mean that it only includes 2 modified files?  Modified
relative to what?  Modified relative to the merge base of its parents?
 Modified relative to its first parent?  to its second parent?
Modified relative to an automatic merge?

Also, by "modified" here do you mean the change type is 'M' in
--name-status output or could the change type also be 'A' (added) or
'D'(deleted) or something else?

> hello/foo/stubs/example.php
> hello/world.php
>
> I want to only get the changes introduced by the merge commit and
> exclude any changes in /foo/stubs/:
> git diff -l0 --name-status --find-renames "$sha"^'!' -- ':!*/foo/stubs/*'

It's not clear to me from your example what the output of say

   git diff --name-status --no-renames "$sha"^'!' | wc -l

would be, though I would find that very interesting.  I'm also curious
what you'd get from each of

  git diff --diff-filter=D --name-status --no-renames "$sha"^'!' | wc -l
  git diff --diff-filter=A --name-status --no-renames "$sha"^'!' | wc -l
  git diff --diff-filter=M --name-status --no-renames "$sha"^'!' | wc -l

(and yes, I am very intentionally leaving off the ':!*/foo/stubs/*'
negative refspec; I want the output without that.)

> Git takes more than 4 minutes to generate this diff, since
> hello/foo/stubs/example.php is a huge file.

How do you know that is the reason?  Especially since...

> When using --no-renames (instead of --find-renames) it's much, much faster.

...this seems to contradict your statement that the reason for the
slow diff is that hello/foo/stubs/example.php is a huge file.

> And without the example.php file, the diff takes less than 1 second
> instead of 4+ minutes.

What do you mean without the example.php file?  Did you rewind
history, remove that file, and then redo the merge so that it is no
longer included?  Or do you mean something else entirely?  What
exactly?

> Funnily enough, when I have a merge commit that contains only that 1
> excluded file, it's the same behavior.
>
> 1) if there's only a single file in a commit, why does --find-renames
> cause a slowdown? There's nothing that could have been renamed in that
> case (probably the same for --find-copies)

I'm not sure what this has to do with the above; you seem to have
switched tracks.  If you have a commit whose toplevel tree has exactly
1 file, and you're diffing it against some other commit with an
unspecified number of files, then if that other commit with N files
happens to have a file with the same name as the commit with exactly 1
file, then --find-renames can't really cause a slowdown.  It'd only
cause a slowdown when the N files in the other commit were all
different filenames than the 1 file in your commit you are diffing
against (but of mostly similar filesize).  But I suspect you meant
something other than what you said here.  Could you clarify the actual
setup?

> 2) could rename detection be "delayed" to only run/check if there are
> actually additions/deletions (and possibly only check those)? If a
> commit only contains modifications (unlike in a really, really 0.0001%
> edge case) but no additions+deletions it's extremely unlikely that
> there's a rename, so detection could be skipped altogether?

Rename detection already does this; in fact, it does better.  Not only
can you exit early when additions + deletions are empty, you can also
exit early when either of the two are empty.

(In fact, there's some other optimizations as well, such as exiting
early if either additions or deletions become empty after removing any
paths involved in exact rename detection, or removing any paths
involved in basename-driven rename matching.)

If you want to see where this is handled; see the "if
(!num_destinations || !num_sources)" check in diffcore-rename.c.

Now, all that said, I suspect you're getting at something with the
negative refspecs that is similar to the optimization idea I had for a
real --follow-renames, but before I jump into that, I'd need you to
clarify your setup a fair amount to make sure we're on the same page.