Re: Git Rename Detection Bug

Elijah Newren <newren@xxxxxxxxx> · Sat, 23 Dec 2023 23:46:01 -0800

Hi Philip,

Sorry for the late reply; I somehow missed this earlier.

On Wed, Nov 15, 2023 at 8:51 AM Philip Oakley <philipoakley@iee.email> wrote:
>
> Hi Elijah,
>
> On 11/11/2023 05:46, Elijah Newren wrote:
> > * filename similarity is extraordinarily expensive compared to exact
> > renames, and if not carefully handled, can sometimes rival the cost of
> > file content similarity computations given our spanhash
> > representations.
>
> I've not heard of spanhash representation before. Any references or
> further reading?

You can find more in diffcore-delta.c, especially the big comment near
the top of the file.  But here's a short explanation of spanhashes:
  * Split files into chunks delimited either by LF or 64 bytes,
whichever comes first.
  * Hash every chunk into an integer between 0 and 107926
  * Keep a character count for each of those integers as well (thus if
a line has N characters, but appears twice in the file, the associated
count for that integer will be 2N).
  * A "spanhash" is the combination of the integer that a chunk (or
span) hashes to, plus the count associated with it.
  * The list/array of spanhashes for a file (i.e. the list/array of
integers and character counts) is used to compare one file to another.

Now, why do I claim that comparison of filenames can rival cost of
file content similarity?  Well, in a monorepo of interest, the median
sized file is named something close to
"modules/client-resources/src/main/resources/buttons/smallTriangleBlackRight.png"
and is 2875 bytes.  As a png, all its chunks are probably the full 64
characters, which works out to about 45 chunks (assuming the 64-byte
chunks are different from each other).  The filename is 79 characters.
So, for this case, 45 pairs of integers vs 79 characters.  So, the
comparison cost is roughly the same order of magnitude.
(Yes, creating the spanhashes is a heavy overhead; however, we only
initialize it once and then do N comparisons of each spanhash to the
other spanhashes.  And we'd be doing N comparisons of each filename to
other filenames, so the overhead of creating the spanhashes can be
overlooked if your merge has enough files modified on both sides of
history.)

Yes, this particular repository is a case I randomly picked that you
can argue is special.  But rather than look at the particular example,
I think it's interesting to check how the spanhash size vs. filename
size scale with repository size.  From my experience: (1) I don't
think the median-sized file varies all that much between small and big
repositories; every time I check a repo the median size seems to be
order of a few thousand bytes, regardless of whether the repository
I'm looking at is tiny or huge, (2) while small repositories often
have much shorter filenames, big repositories often will have
filenames even longer than my example; length of filename tends to
grow with repository size from deep directory nestings.  So, between
these two facts, I'd expect the filename comparison costs to grow
relative to file content comparison costs, when considering only
median-sized files being modified.  And since it's common to have
merges or rebases or diffs where only approximately-median-sized files
are involved, I think this is relevant to look at.  Finally, since I
already had an example that showed the cost likely roughly comparable
for a random repository of interest, and it's not even all that big a
repository compared to many out there, I think the combination
motivates pretty well my claim that filename similarity costs _could_
rival file content similarity costs if one wasn't careful.

I don't have a rigorous proof here.  And, in fact, I ended up doing
this rough back-of-the-envelope analysis _after_ implementing some
filename similarity comparison ideas and seeing performance degrade
badly, and wondering why it made such a difference.  I don't know if I
ever got exact numbers, but I certainly didn't record them.  This
rough analysis, though, was what made me realize that I needed to be
careful with any such added filename comparisons, though, and is why
I'm leery of adding more.