Hi Philip, Sorry for the late reply; I somehow missed this earlier. On Wed, Nov 15, 2023 at 8:51 AM Philip Oakley <philipoakley@iee.email> wrote: > > Hi Elijah, > > On 11/11/2023 05:46, Elijah Newren wrote: > > * filename similarity is extraordinarily expensive compared to exact > > renames, and if not carefully handled, can sometimes rival the cost of > > file content similarity computations given our spanhash > > representations. > > I've not heard of spanhash representation before. Any references or > further reading? You can find more in diffcore-delta.c, especially the big comment near the top of the file. But here's a short explanation of spanhashes: * Split files into chunks delimited either by LF or 64 bytes, whichever comes first. * Hash every chunk into an integer between 0 and 107926 * Keep a character count for each of those integers as well (thus if a line has N characters, but appears twice in the file, the associated count for that integer will be 2N). * A "spanhash" is the combination of the integer that a chunk (or span) hashes to, plus the count associated with it. * The list/array of spanhashes for a file (i.e. the list/array of integers and character counts) is used to compare one file to another. Now, why do I claim that comparison of filenames can rival cost of file content similarity? Well, in a monorepo of interest, the median sized file is named something close to "modules/client-resources/src/main/resources/buttons/smallTriangleBlackRight.png" and is 2875 bytes. As a png, all its chunks are probably the full 64 characters, which works out to about 45 chunks (assuming the 64-byte chunks are different from each other). The filename is 79 characters. So, for this case, 45 pairs of integers vs 79 characters. So, the comparison cost is roughly the same order of magnitude. (Yes, creating the spanhashes is a heavy overhead; however, we only initialize it once and then do N comparisons of each spanhash to the other spanhashes. And we'd be doing N comparisons of each filename to other filenames, so the overhead of creating the spanhashes can be overlooked if your merge has enough files modified on both sides of history.) Yes, this particular repository is a case I randomly picked that you can argue is special. But rather than look at the particular example, I think it's interesting to check how the spanhash size vs. filename size scale with repository size. From my experience: (1) I don't think the median-sized file varies all that much between small and big repositories; every time I check a repo the median size seems to be order of a few thousand bytes, regardless of whether the repository I'm looking at is tiny or huge, (2) while small repositories often have much shorter filenames, big repositories often will have filenames even longer than my example; length of filename tends to grow with repository size from deep directory nestings. So, between these two facts, I'd expect the filename comparison costs to grow relative to file content comparison costs, when considering only median-sized files being modified. And since it's common to have merges or rebases or diffs where only approximately-median-sized files are involved, I think this is relevant to look at. Finally, since I already had an example that showed the cost likely roughly comparable for a random repository of interest, and it's not even all that big a repository compared to many out there, I think the combination motivates pretty well my claim that filename similarity costs _could_ rival file content similarity costs if one wasn't careful. I don't have a rigorous proof here. And, in fact, I ended up doing this rough back-of-the-envelope analysis _after_ implementing some filename similarity comparison ideas and seeing performance degrade badly, and wondering why it made such a difference. I don't know if I ever got exact numbers, but I certainly didn't record them. This rough analysis, though, was what made me realize that I needed to be careful with any such added filename comparisons, though, and is why I'm leery of adding more.