On Sun, Jul 24, 2022 at 7:29 AM Philip <philip.c.peterson@xxxxxxxxx> wrote: > > Hello all, > > I noticed that Git LFS-tracked files cannot correctly detect renames, "correctly"? I was expecting you to say that it just wouldn't detect renames at all for LFS files. Does the wording of your question imply that Git is detecting LFS files as renames that aren't actually renames? You got me curious. It appears the LFS pointer files are three lines long (unless I'm misreading the spec; I have virtually no experience with LFS). The first line appears to always be the same for practical purposes, the second line holds the hash, and the third line holds the real file length. What counts as far as rename detection isn't the number of matching lines, though, but the number of matching characters from matching lines. So if one pointer file had e.g. version https://git-lfs.github.com/spec/v1 oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca24d17e2393 size 12345 and you have another that looks like version https://git-lfs.github.com/spec/v1 oid sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa24d17e2393 size 12345 then, since git splits lines into 64-byte chunks, then our "matching lines" are: version https://git-lfs.github.com/spec/v1 24d17e2393 size 12345 (for a total of 66 bytes, including newline characters), and the "unmatching lines" are oid sha256:4d7a214614ab2935c943f9e0ff69d22eadbb8f32b1258daaa5e2ca oid sha256:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa (neither of which has a trailing newline) The longest of these two pointer files is 130 bytes (actually, both files are 130 bytes). Since the matching bytes from matching lines is 66, we check that 66/130.0 > 0.5, so Git will consider this a rename using the default similarity threshold of 50%. So, for LFS files to be detected as a rename, the real file size has to be at least 100 bytes (so that the real file size has at least three digits), the real files have to be the same size (e.g. in our example they were both 12345 bytes), and the last 10 hex digits of their sha256 must match. The first condition is probably trivial to satisfy. The second is a little unlikely, but it's pretty easy to imagine random matches. However, you'd have to have an awful lot of LFS files in a repository before there's any realistic chance of meeting the third condition, though. Somewhere on the order of a million of them (based on birthday paradox approximations) -- all with the same underlying file size on the third line. (Or, if you had lots of different underlying file sizes, then dramatically more LFS files would be required to get a collision.) Given that each file is large, that'd have to be a ginormous repository if fully checked out. So, is this actually an issue in practice? (Perhaps there's a risk of a dedicated disgruntled employee generating trailing hash collisions just to make git log show misleading rename output for LFS files...but if so, that seems pretty tame for the amount of effort put in?) Or did I read too much into your "correctly" wording, and were you simply not getting any LFS files detected as renames despite knowing some of the underlying large files are actually renames? > probably because Git is not doing a similarity check on the content. > Doing so would require having the content (instead of just the LFS > pointer), and that would require running the smudge filter, which > could take a very long time due to network requests, very expensive if > done on every file in the repo. ...and it probably wouldn't return meaningful results anyway since the 'L' in "LFS" stands for 'Large' and Git's rename detection only works for small to medium sized files. (I mean, it'll technically operate on large files, at least those smaller than core.bigFileThreshold, but the quality of its answers degrades linearly with file size until it's essentially meaningless at about 8MB...with the exception that it does notice that files with dissimilar sizes can't be renames.) > When doing a `git log` for example, it > would need to run the smudge filter on all LFS files in all revisions, > potentially pulling down all the content from the LFS server, just to > decide if there were any renames. > > I wonder if there has been any thought given to whether a similarity > index can be pre-computed somewhere? (Maybe upon commit with each of > the commit's ancestors.) Computing for each commit relative to each of its ancestors would only help with 'git log'; it wouldn't help with diff/merge/rebase/revert/etc. If you want rename handling it to work in general for these files, you'd have to precompute it for each commit with each other commit in history. And for it to be reliable, you'd have to update it with every new commit. That'd make `git commit` take time relative to the number of commits in history. It'd make `git pull` and such take an amount of time relative to the number of commits in history multiplied by the number of commits being downloaded. And, of course, that "relative" consideration in both those places includes a factor scaled by the average sizes of the large files being compared, so the "constant" is pretty big too. > Or if this limitation has been discussed > before here. I did a quick search and didn't see anything. However, I have thought about the idea of precomputing similarities in general before, but realized that it ends up just shifting performance problems elsewhere and quite likely making them worse overall. So, I think you'd need a different solution of some sort here if you want rename detection for LFS files.