Re: Pre-computed similarity indexes

Elijah Newren <newren@xxxxxxxxx> · Tue, 26 Jul 2022 00:27:27 -0700

On Sun, Jul 24, 2022 at 7:29 AM Philip <philip.c.peterson@xxxxxxxxx> wrote:
>
> Hello all,
>
> I noticed that Git LFS-tracked files cannot correctly detect renames,

"correctly"?  I was expecting you to say that it just wouldn't detect
renames at all for LFS files.  Does the wording of your question imply
that Git is detecting LFS files as renames that aren't actually
renames?

You got me curious.  It appears the LFS pointer files are three lines
long (unless I'm misreading the spec; I have virtually no experience
with LFS).  The first line appears to always be the same for practical
purposes, the second line holds the hash, and the third line holds the
real file length.  What counts as far as rename detection isn't the
number of matching lines, though, but the number of matching
characters from matching lines.  So if one pointer file had e.g.

    version https://git-lfs.github.com/spec/v1
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393
    size 12345

and you have another that looks like

    version https://git-lfs.github.com/spec/v1
    oid sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa24d17e2393
    size 12345

then, since git splits lines into 64-byte chunks, then our "matching lines" are:
    version https://git-lfs.github.com/spec/v1
    24d17e2393
    size 12345
(for a total of 66 bytes, including newline characters), and the
"unmatching lines" are
    oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca
    oid sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
(neither of which has a trailing newline)

The longest of these two pointer files is 130 bytes (actually, both
files are 130 bytes).  Since the matching bytes from matching lines is
66, we check that 66/130.0 > 0.5, so Git will consider this a rename
using the default similarity threshold of 50%.

So, for LFS files to be detected as a rename, the real file size has
to be at least 100 bytes (so that the real file size has at least
three digits), the real files have to be the same size (e.g. in our
example they were both 12345 bytes), and the last 10 hex digits of
their sha256 must match.

The first condition is probably trivial to satisfy.  The second is a
little unlikely, but it's pretty easy to imagine random matches.
However, you'd have to have an awful lot of LFS files in a repository
before there's any realistic chance of meeting the third condition,
though.  Somewhere on the order of a million of them (based on
birthday paradox approximations) -- all with the same underlying file
size on the third line.  (Or, if you had lots of different underlying
file sizes, then dramatically more LFS files would be required to get
a collision.)  Given that each file is large, that'd have to be a
ginormous repository if fully checked out.  So, is this actually an
issue in practice?  (Perhaps there's a risk of a dedicated disgruntled
employee generating trailing hash collisions just to make git log show
misleading rename output for LFS files...but if so, that seems pretty
tame for the amount of effort put in?)

Or did I read too much into your "correctly" wording, and were you
simply not getting any LFS files detected as renames despite knowing
some of the underlying large files are actually renames?

> probably because Git is not doing a similarity check on the content.
> Doing so would require having the content (instead of just the LFS
> pointer), and that would require running the smudge filter, which
> could take a very long time due to network requests, very expensive if
> done on every file in the repo.

...and it probably wouldn't return meaningful results anyway since the
'L' in "LFS" stands for 'Large' and Git's rename detection only works
for small to medium sized files.  (I mean, it'll technically operate
on large files, at least those smaller than core.bigFileThreshold, but
the quality of its answers degrades linearly with file size until it's
essentially meaningless at about 8MB...with the exception that it does
notice that files with dissimilar sizes can't be renames.)

> When doing a `git log` for example, it
> would need to run the smudge filter on all LFS files in all revisions,
> potentially pulling down all the content from the LFS server, just to
> decide if there were any renames.
>
> I wonder if there has been any thought given to whether a similarity
> index can be pre-computed somewhere? (Maybe upon commit with each of
> the commit's ancestors.)

Computing for each commit relative to each of its ancestors would only
help with 'git log'; it wouldn't help with
diff/merge/rebase/revert/etc.  If you want rename handling it to work
in general for these files, you'd have to precompute it for each
commit with each other commit in history.  And for it to be reliable,
you'd have to update it with every new commit.  That'd make `git
commit` take time relative to the number of commits in history.  It'd
make `git pull` and such take an amount of time relative to the number
of commits in history multiplied by the number of commits being
downloaded.  And, of course, that "relative" consideration in both
those places includes a factor scaled by the average sizes of the
large files being compared, so the "constant" is pretty big too.

> Or if this limitation has been discussed
> before here.

I did a quick search and didn't see anything.  However, I have thought
about the idea of precomputing similarities in general before, but
realized that it ends up just shifting performance problems elsewhere
and quite likely making them worse overall.  So, I think you'd need a
different solution of some sort here if you want rename detection for
LFS files.