On Thu, May 11, 2023 at 03:51:16PM -0700, Junio C Hamano wrote: > Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes: > > > So...how do we proceed? I can see at least 2 ways: > > > > - Decide that we're going to stick with the details of the existing > > implementation and declare that "data" is always interpreted as signed. > > In that case, I would put "signed" wherever necessary, rename the > > function to something that is not "murmur3", and change the names of > > byte1 etc. to indicate that they are not constrained to be less than or > > equal to 0xff. > > > > - Bump the version number to 2 and correct the implementation to > > match murmur3 (so, "data" is unsigned). Then we would have to think of > > a transition plan. One possible one might be to always reject version > > 1 bloom filters, which I'm personally OK with, but it may seem too > > heavy a cost to some since in the perhaps typical case where a repo has > > filenames restricted to 0x7f and below, the existing bloom filters are > > still correct. > > If path filter hashing were merely advisory, in the sense that if a > matching data is found, great, the processing goes faster, but if > not, we would get correct results albeit not so quickly, a third > option would be to just update the implementation without updating > the version number. But we may not be so lucky---you must have seen > a wrong result returned quickly, which is not what we want to see. Right; from my understanding of Jonathan's message, I believe that we would get the wrong results if the implementation of not-quite-murmur3 were corrected without updating the on-disk Bloom filters. We already have the bloom_filter_settings struct, which could also encode whether or not the data was computed with sign extension or not. If we are consistent in computing the hashes when we write Bloom filters and when we re-compute hashes to query them, I'd think we would be able to reuse the existing filters. That would be nice to do if we could. Unfortunately, I don't think there is an easy way to determine the signed-ness of an existing set of Bloom filters, nor a guarantee that they all have the same signed-ness (IOW, the Bloom filters could have been built up across multiple copies of Git, each with different compiler settings). So I'm not sure that that is a productive direction for us to take. > But if I recall correctly we made the file format in such a way that > bumping the version number is cheap in that transition can appear > seamless. An updated implementation can just be told to _ignore_ > old and possibly incorrect Bloom filters until it gets told to > recompute, at which time it can write a correct one with a new > version number. So I would prefer your "Bump the version number and > ignore the old and possibly wrong data". Version numbers are easy to increment, as is the case when adding new chunks to the chunked file format used by commit-graph, multi-pack-index, etc. However, computing Bloom filters en-masse from scratch is fairly expensive. At GitHub when we were rolling out Bloom filters to all repositories a few years ago, I wrote 809e0327f5 (builtin/commit-graph.c: introduce '--max-new-filters=<n>', 2020-09-18) to avoid spending too much time computing new filters from scratch. If we do have to re-compute all of the filters from scratch, it may be worth considering enumerating all of the repository's paths first and seeing if any of them are >0xff. If none are, we can propagate the existing filters forward without having to compute new ones. Better yet, we should be able to reuse existing Bloom filter data for paths that have all characters <=0xff, and only recompute them where necessary. That makes much more sense than the previous paragraph. Thanks, Taylor