This issue arose while I was implementing changed path filters (the bloom filters in the commit graph file) for JGit [1]. It turns out that if a repository contains paths that have at least one character greater than 0x7f, Git writes these filters and interprets them (when read from the commit graph file) differently depending on whether "char" is signed or unsigned, and when "char" is signed (this can be controlled by a compiler flag), the implementation of the murmur3 hash differs from the one in Wikipedia and elsewhere. (I looked into Git's Makefile and didn't see anything that controlled whether "char" is signed, so I presume that the default of the compiler used applies.) This can be seen from the murmur3_seeded() function in bloom.c: > uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) > { [snip] > uint32_t byte1 = (uint32_t)data[4*i]; Note that data[4*i] is of type "char". When I first read this, I thought that no sign extension would occur regardless of the signedness of char (that is, (uint32_t)(signed char)-1 == (uint32_t)(unsigned char)-1) but it turns out that sign extension does occur if the thing being cast is signed (that is, (uint32_t)(signed char)-1 != (uint32_t)(unsigned char)-1) [2]. The implementation of the murmur3 hash in Wikipedia (and presumably the intention of the author here, since the variable is named "byte1") expects this value to not exceed 0xff, but this is not the case if sign extension occurs. I looked if this was discussed at the time this patch was presented [3] but couldn't find anything related. So...how do we proceed? I can see at least 2 ways: - Decide that we're going to stick with the details of the existing implementation and declare that "data" is always interpreted as signed. In that case, I would put "signed" wherever necessary, rename the function to something that is not "murmur3", and change the names of byte1 etc. to indicate that they are not constrained to be less than or equal to 0xff. - Bump the version number to 2 and correct the implementation to match murmur3 (so, "data" is unsigned). Then we would have to think of a transition plan. One possible one might be to always reject version 1 bloom filters, which I'm personally OK with, but it may seem too heavy a cost to some since in the perhaps typical case where a repo has filenames restricted to 0x7f and below, the existing bloom filters are still correct. Of the two, I prefer the latter, but I'm open to suggestions. [1] https://git.eclipse.org/r/c/jgit/jgit/+/201851/2 [2] Verified in practice and also prescribed by the C standard; see e.g. https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf section 6.3.1.3 point 2. [3] https://lore.kernel.org/git/a5aa3415c05ee9bc67a9471445a20c71a9834673.1586192395.git.gitgitgadget@xxxxxxxxx/