Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/22/2023 5:48 PM, Jonathan Tan wrote:
> The murmur3 implementation in bloom.c has a bug when converting series
> of 4 bytes into network-order integers when char is signed (which is
> controllable by a compiler option, and the default signedness of char is
> platform-specific). When a string contains characters with the high bit
> set, this bug causes results that, although internally consistent within
> Git, does not accord with other implementations of murmur3 and even with
> Git binaries that were compiled with different signedness of char. This
> bug affects both how Git writes changed path filters to disk and how Git
> interprets changed path filters on disk.
> 
> Therefore, fix this bug. And because changed path filters on disk might
> no longer be compatible, teach Git to write "2" as the version when
> writing changed path filters (instead of "1" currently), and only accept
> "2" as the version when reading them (instead of "1" currently).

I appreciate that you discovered and are presenting a way out of this
problem, however the current approach does not preserve compatibility
enough.

> @@ -82,10 +82,10 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len)
>  
>  	uint32_t k;
>  	for (i = 0; i < len4; i++) {
> -		uint32_t byte1 = (uint32_t)data[4*i];
> -		uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8;
> -		uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16;
> -		uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24;
> +		uint32_t byte1 = (uint32_t)(unsigned char)data[4*i];
> +		uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8;
> +		uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16;
> +		uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24;
>  		k = byte1 | byte2 | byte3 | byte4;
>  		k *= c1;
>  		k = rotate_left(k, r1);

By changing this algorithm directly (instead of making an "unsigned" version,
or renaming this one to the "maybe signed" version), you are making it
impossible for us to ship a version that can read version 1 Bloom filters,
so all read-only history operations will immediately slow down (because they
will ignore v1 chunks, better than incorrectly parsing v1 chunks).

> @@ -314,7 +314,7 @@ static int graph_read_bloom_data(const unsigned char *chunk_start,
>  	g->chunk_bloom_data = chunk_start;
>  	hash_version = get_be32(chunk_start);
>  
> -	if (hash_version != 1)
> +	if (hash_version != BLOOM_HASH_VERSION)
>  		return 0;
  
Here's where we would ignore v1 filters, instead of continuing to read them
(with all the risks involved).

In order for this to be something we can ship safely to environments that depend
on changed-path Bloom filters, we need to be able to parse v1 filters. It would
be even better if we didn't write v2 filters by default, but instead hid it
behind a config option that is off by default for at least one major release.

I notice that you didn't update the commit-graph format docs, which seems like
a valuable place to describe the new version number, as well as any plans to
completely deprecate v1. For instance, describing the v1 implementation as
having inconsistent application of murmur3 is a valuable thing to have, but
then describe the plans for deprecating it as an unsafe format.

Here is a potential plan to consider:

 1. v2.42.0 includes writing v2 format, off by default.
 2. v2.43.0 writes v2 format by default.
 3. v2.44.0 no longer parses v1 format (ignored without error).

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux