Re: [PATCH v5 4/4] commit-graph: new filter ver. that fixes murmur3

Taylor Blau <me@xxxxxxxxxxxx> · Wed, 19 Jul 2023 14:24:44 -0400

On Thu, Jul 13, 2023 at 02:42:11PM -0700, Jonathan Tan wrote:
> The murmur3 implementation in bloom.c has a bug when converting series
> of 4 bytes into network-order integers when char is signed (which is
> controllable by a compiler option, and the default signedness of char is
> platform-specific). When a string contains characters with the high bit
> set, this bug causes results that, although internally consistent within
> Git, does not accord with other implementations of murmur3 and even with
> Git binaries that were compiled with different signedness of char. This
> bug affects both how Git writes changed path filters to disk and how Git
> interprets changed path filters on disk.

I think that you make a worthwhile point that the existing
implementation is internally consistent, but doesn't actually implement
a conventional murmur3. I wonder if it's worth being explicit where you
mention its internal consistency to say that the existing implementation
would never cause us to produce wrong results, but wouldn't be readable
by other off-the-shelf implementations of murmur3.

(To be clear, I think that you already make this point, I'm just
suggesting that it may be worth spelling it out even more explicitly
than what is written above).

> Therefore, introduce a new version (2) of changed path filters that
> corrects this problem. The existing version (1) is still supported and
> is still the default, but users should migrate away from it as soon
> as possible.

Makes sense.

> Because this bug only manifests with characters that have the high bit
> set, it may be possible that some (or all) commits in a given repo would
> have the same changed path filter both before and after this fix is
> applied. However, in order to determine whether this is the case, the
> changed paths would first have to be computed, at which point it is not
> much more expensive to just compute a new changed path filter.

Hmm. I think in the general case that is true, but I wonder if there's a
shortcut we could take for trees that are known to not have *any*
characters with their high-order bits set. That is, if we scan both of
the first parent's trees and determine that no such paths exist, the
contents of the Bloom filter would be the same in either version, right?

I think that that would be faster than recomputing all filters from
scratch. In either case, we have to load the whole tree. But if we can
quickly scan (and cache our results by setting some bit--indicating the
absence of paths with characters having their highest bit set--on the tree
objects' `flags` field), then we should be able to copy forward the
existing representation of the filter.

I think the early checks would be more expensive, since in the worst
case you have to walk the entire tree, only to realize that you actually
wanted to compute a first-parent tree diff, meaning you have to
essentially repeat the whole walk over again. But for repositories that
have few or no commits whose Bloom filters need computing, I think it
would be significantly faster, since many of the sub-trees wouldn't need
to be visited again.

> There is a change in write_commit_graph(). graph_read_bloom_data()
> makes it possible for chunk_bloom_data to be non-NULL but
> bloom_filter_settings to be NULL, which causes a segfault later on. I
> produced such a segfault while developing this patch, but couldn't find
> a way to reproduce it neither after this complete patch (or before),
> but in any case it seemed like a good thing to include that might help
> future patch authors.

Hmm. Interesting, I'd love to know more about what you were doing that
produced the segfault. I think it would be worth it to prove to
ourselves that this segfault can't occur in the wild. Or if it can, it
would be worth it to understand the bug and prevent it from happening.

> +static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
>  {
>  	const uint32_t c1 = 0xcc9e2d51;
>  	const uint32_t c2 = 0x1b873593;
> @@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
>  	int i;
>  	const uint32_t seed0 = 0x293ae76f;
>  	const uint32_t seed1 = 0x7e646e2c;
> -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
> -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
> +	const uint32_t hash0 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
> +	const uint32_t hash1 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);

I do admire the ternary operator over the function being called, as I
think that Stolee pointed out earlier in this series. But I worry that
these two checks could fall out of sync with each other, causing us to
pick incosistent values for hash0, and hash1, respectively.

FWIW, I would probably write this as:

    const uint32_t hash0, hash1;
    if (settings->hash_version == 2) {
        hash0 = murmur3_seeded_v2(seed0, data, len);
        hash1 = murmur3_seeded_v2(seed1, data, len);
    } else {
        hash0 = murmur3_seeded_v1(seed0, data, len);
        hash1 = murmur3_seeded_v1(seed1, data, len);
    }

I suppose that there isn't anything keeping the calls within each arm of
the if-statement above in sync with each other (i.e., I could call
murmur3_seeded_v1() immediately before dispatching a call to
murmur3_seeded_v2()). So in that sense I think that this is no more or
less safe than what's already written.

But IMHO I think this one reads more cleanly, so I might prefer it over
what you have in this patch. But I don't feel so strongly about it
either way.

> diff --git a/commit-graph.c b/commit-graph.c
> index 9b72319450..c50107eed5 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -302,16 +302,25 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start,
>  	return 0;
>  }
>
> +struct graph_read_bloom_data_data {
> +	struct commit_graph *g;
> +	int *commit_graph_changed_paths_version;
> +};
> +

Nit: maybe `graph_read_bloom_data_context`, to avoid repeating "data"? I
don't have strong feelings here, FWIW.

The rest of the implementation and tests look good to me.

Thanks,
Taylor