Taylor Blau <me@xxxxxxxxxxxx> writes: > On Thu, Jul 13, 2023 at 02:42:11PM -0700, Jonathan Tan wrote: > > The murmur3 implementation in bloom.c has a bug when converting series > > of 4 bytes into network-order integers when char is signed (which is > > controllable by a compiler option, and the default signedness of char is > > platform-specific). When a string contains characters with the high bit > > set, this bug causes results that, although internally consistent within > > Git, does not accord with other implementations of murmur3 and even with > > Git binaries that were compiled with different signedness of char. This > > bug affects both how Git writes changed path filters to disk and how Git > > interprets changed path filters on disk. > > I think that you make a worthwhile point that the existing > implementation is internally consistent, but doesn't actually implement > a conventional murmur3. I wonder if it's worth being explicit where you > mention its internal consistency to say that the existing implementation > would never cause us to produce wrong results, but wouldn't be readable > by other off-the-shelf implementations of murmur3. > > (To be clear, I think that you already make this point, I'm just > suggesting that it may be worth spelling it out even more explicitly > than what is written above). OK, I've added some more text describing this. > > Because this bug only manifests with characters that have the high bit > > set, it may be possible that some (or all) commits in a given repo would > > have the same changed path filter both before and after this fix is > > applied. However, in order to determine whether this is the case, the > > changed paths would first have to be computed, at which point it is not > > much more expensive to just compute a new changed path filter. > > Hmm. I think in the general case that is true, but I wonder if there's a > shortcut we could take for trees that are known to not have *any* > characters with their high-order bits set. That is, if we scan both of > the first parent's trees and determine that no such paths exist, the > contents of the Bloom filter would be the same in either version, right? > > I think that that would be faster than recomputing all filters from > scratch. In either case, we have to load the whole tree. But if we can > quickly scan (and cache our results by setting some bit--indicating the > absence of paths with characters having their highest bit set--on the tree > objects' `flags` field), then we should be able to copy forward the > existing representation of the filter. > > I think the early checks would be more expensive, since in the worst > case you have to walk the entire tree, only to realize that you actually > wanted to compute a first-parent tree diff, meaning you have to > essentially repeat the whole walk over again. But for repositories that > have few or no commits whose Bloom filters need computing, I think it > would be significantly faster, since many of the sub-trees wouldn't need > to be visited again. So for repositories that need little-to-no recomputation of Bloom filters, your idea likely means that each tree needs to be read once, as compared to recomputing everything in which, I think, each tree needs to be read roughly twice (once when computing the Bloom filter for the commit that introduces it, and once for the commit that substitutes a different tree in place). I could change the text of the commit message to discuss this (instead of the blanket statement that it would be too hard), although I think that an implementation of this can be done after this patchset. What do you think? > > There is a change in write_commit_graph(). graph_read_bloom_data() > > makes it possible for chunk_bloom_data to be non-NULL but > > bloom_filter_settings to be NULL, which causes a segfault later on. I > > produced such a segfault while developing this patch, but couldn't find > > a way to reproduce it neither after this complete patch (or before), > > but in any case it seemed like a good thing to include that might help > > future patch authors. > > Hmm. Interesting, I'd love to know more about what you were doing that > produced the segfault. I think it would be worth it to prove to > ourselves that this segfault can't occur in the wild. Or if it can, it > would be worth it to understand the bug and prevent it from happening. If I remember correctly, I changed the version in DEFAULT_BLOOM_FILTER_SETTINGS to 2 and ran it to see what broke. This was a while back so I don't remember it exactly, though. > > @@ -130,8 +187,10 @@ void fill_bloom_key(const char *data, > > int i; > > const uint32_t seed0 = 0x293ae76f; > > const uint32_t seed1 = 0x7e646e2c; > > - const uint32_t hash0 = murmur3_seeded(seed0, data, len); > > - const uint32_t hash1 = murmur3_seeded(seed1, data, len); > > + const uint32_t hash0 = (settings->hash_version == 2 > > + ? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len); > > + const uint32_t hash1 = (settings->hash_version == 2 > > + ? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len); > > I do admire the ternary operator over the function being called, as I > think that Stolee pointed out earlier in this series. But I worry that > these two checks could fall out of sync with each other, causing us to > pick incosistent values for hash0, and hash1, respectively. > > FWIW, I would probably write this as: > > const uint32_t hash0, hash1; > if (settings->hash_version == 2) { > hash0 = murmur3_seeded_v2(seed0, data, len); > hash1 = murmur3_seeded_v2(seed1, data, len); > } else { > hash0 = murmur3_seeded_v1(seed0, data, len); > hash1 = murmur3_seeded_v1(seed1, data, len); > } > > I suppose that there isn't anything keeping the calls within each arm of > the if-statement above in sync with each other (i.e., I could call > murmur3_seeded_v1() immediately before dispatching a call to > murmur3_seeded_v2()). So in that sense I think that this is no more or > less safe than what's already written. > > But IMHO I think this one reads more cleanly, so I might prefer it over > what you have in this patch. But I don't feel so strongly about it > either way. I'm OK either way, so I'll go with your approach (except the "const", since my compiler doesn't like that). > > @@ -302,16 +302,25 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start, > > return 0; > > } > > > > +struct graph_read_bloom_data_data { > > + struct commit_graph *g; > > + int *commit_graph_changed_paths_version; > > +}; > > + > > Nit: maybe `graph_read_bloom_data_context`, to avoid repeating "data"? I > don't have strong feelings here, FWIW. I'll use "context". > The rest of the implementation and tests look good to me. > > Thanks, > Taylor Thanks for the review.