Hi Elijah, Many thanks.. personal notes in-line. On 24/12/2023 07:46, Elijah Newren wrote: > Hi Philip, > > Sorry for the late reply; I somehow missed this earlier. > > On Wed, Nov 15, 2023 at 8:51 AM Philip Oakley <philipoakley@iee.email> wrote: >> >> Hi Elijah, >> >> On 11/11/2023 05:46, Elijah Newren wrote: >>> * filename similarity is extraordinarily expensive compared to exact >>> renames, and if not carefully handled, can sometimes rival the cost of >>> file content similarity computations given our spanhash >>> representations. >> >> I've not heard of spanhash representation before. Any references or >> further reading? > > You can find more in diffcore-delta.c, especially the big comment near > the top of the file. +1 > But here's a short explanation of spanhashes: > * Split files into chunks delimited either by LF or 64 bytes, > whichever comes first. neat > * Hash every chunk into an integer between 0 and 107926 as per the comment, this is 1 less than a nice prime 107927 that fits 17bits. Some discussions at https://lore.kernel.org/git/7vwtezt202.fsf@xxxxxxxxxxxxxxxxxxxxxxxx/ and surrounding messages. The hash is very similar to a CRC, a rotating 64bit value, using 7 bit shifts and a 8bit char addition, then reduced to a hash computed at ~#L157 > * Keep a character count for each of those integers as well (thus if > a line has N characters, but appears twice in the file, the associated > count for that integer will be 2N). > * A "spanhash" is the combination of the integer that a chunk (or > span) hashes to, plus the count associated with it. > * The list/array of spanhashes for a file (i.e. the list/array of > integers and character counts) is used to compare one file to another. I was surprised to see that I'd been in the area at #L162 ;-) Thank you for the useful summary. > > Now, why do I claim that comparison of filenames can rival cost of > file content similarity? Well, in a monorepo of interest, the median > sized file is named something close to > "modules/client-resources/src/main/resources/buttons/smallTriangleBlackRight.png" > and is 2875 bytes. As a png, all its chunks are probably the full 64 > characters, which works out to about 45 chunks (assuming the 64-byte > chunks are different from each other). The filename is 79 characters. > So, for this case, 45 pairs of integers vs 79 characters. So, the > comparison cost is roughly the same order of magnitude. > (Yes, creating the spanhashes is a heavy overhead; however, we only > initialize it once and then do N comparisons of each spanhash to the > other spanhashes. And we'd be doing N comparisons of each filename to > other filenames, so the overhead of creating the spanhashes can be > overlooked if your merge has enough files modified on both sides of > history.) Nice point about the hashes only being computed once. > > Yes, this particular repository is a case I randomly picked that you > can argue is special. But rather than look at the particular example, > I think it's interesting to check how the spanhash size vs. filename > size scale with repository size. From my experience: (1) I don't > think the median-sized file varies all that much between small and big > repositories; every time I check a repo the median size seems to be > order of a few thousand bytes, regardless of whether the repository > I'm looking at is tiny or huge, (2) while small repositories often > have much shorter filenames, big repositories often will have > filenames even longer than my example; length of filename tends to > grow with repository size from deep directory nestings. So, between > these two facts, I'd expect the filename comparison costs to grow > relative to file content comparison costs, when considering only > median-sized files being modified. And since it's common to have > merges or rebases or diffs where only approximately-median-sized files > are involved, I think this is relevant to look at. Finally, since I > already had an example that showed the cost likely roughly comparable > for a random repository of interest, and it's not even all that big a > repository compared to many out there, I think the combination > motivates pretty well my claim that filename similarity costs _could_ > rival file content similarity costs if one wasn't careful. > > I don't have a rigorous proof here. And, in fact, I ended up doing > this rough back-of-the-envelope analysis _after_ implementing some > filename similarity comparison ideas and seeing performance degrade > badly, and wondering why it made such a difference. I don't know if I > ever got exact numbers, but I certainly didn't record them. This > rough analysis, though, was what made me realize that I needed to be > careful with any such added filename comparisons, though, and is why > I'm leery of adding more. Thanks again.