Re: [PATCH v4 4/4] commit-graph: new filter ver. that fixes murmur3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 6/13/2023 1:39 PM, Jonathan Tan wrote:

>  commitGraph.changedPathsVersion::
>  	Specifies the version of the changed-path Bloom filters that Git will read and
> -	write. May be 0 or 1. Any changed-path Bloom filters on disk that do not
> +	write. May be 0, 1, or 2. Any changed-path Bloom filters on disk that do not
>  	match the version set in this config variable will be ignored.
>  +
>  Defaults to 1.

Is this a good place to document the planned modification of this default in
future versions of Git?

> +static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len)
>  {
>  	const uint32_t c1 = 0xcc9e2d51;
>  	const uint32_t c2 = 0x1b873593;
> @@ -130,8 +187,10 @@ void fill_bloom_key(const char *data,
>  	int i;
>  	const uint32_t seed0 = 0x293ae76f;
>  	const uint32_t seed1 = 0x7e646e2c;
> -	const uint32_t hash0 = murmur3_seeded(seed0, data, len);
> -	const uint32_t hash1 = murmur3_seeded(seed1, data, len);
> +	const uint32_t hash0 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed0, data, len);
> +	const uint32_t hash1 = (settings->hash_version == 2
> +		? murmur3_seeded_v2 : murmur3_seeded_v1)(seed1, data, len);

This is the critical step. I've not seen this ternary trick within
a function name choice, but it makes for a compact check. Nice.

However, I think the 'settings->hash_version' is the wrong place
to look for the condition. We should be getting this value from the
commit-graph we are reading. (More on this later.)

>  struct bloom_filter_settings {
>  	/*
>  	 * The version of the hashing technique being used.
> -	 * We currently only support version = 1 which is
> +	 * The newest version is 2, which is
>  	 * the seeded murmur3 hashing technique implemented
> -	 * in bloom.c.
> +	 * in bloom.c. Bloom filters of version 1 were created
> +	 * with prior versions of Git, which had a bug in the
> +	 * implementation of the hash function.

...

> +struct graph_read_bloom_data_data {
> +	struct commit_graph *g;
> +	int commit_graph_changed_paths_version;
> +};
> +
>  static int graph_read_bloom_data(const unsigned char *chunk_start,
>  				  size_t chunk_size, void *data)
>  {
> -	struct commit_graph *g = data;
> +	struct graph_read_bloom_data_data *d = data;
> +	struct commit_graph *g = d->g;
>  	uint32_t hash_version;
>  	g->chunk_bloom_data = chunk_start;
>  	hash_version = get_be32(chunk_start);
>  
> -	if (hash_version != 1)
> +	if (hash_version != d->commit_graph_changed_paths_version)
>  		return 0;

This makes it appear like we cannot read a commit-graph that has
a Bloom filter version that doesn't match the configured version.

This seems incorrect. If we want to configure to _write_ v2, we
should still be able to _read_ v1 concurrently until those v2
filters are written.

This check should be:

	if (hash_version <= 0 || hash_version > 2)
		return 0;

and then 

	g->filter_hash_version = hash_version;

to store this hash version somewhere in the graph. This way, we
can read any commit-graph file and will not suffer performance
problems in the time between setting the config value and writing
the new commit-graph file.

> -	if (s->commit_graph_changed_paths_version == 1) {
> +	if (s->commit_graph_changed_paths_version == 1
> +	    || s->commit_graph_changed_paths_version == 2) {

Perhaps this could just be

	if (s->commit_graph_changed_paths_version) {

to say "not zero means we still read the filters". Though, since
this config _should_ mean "which version do we _write_?" it might
be good to go back on the "unifying the two config options".

> +		struct graph_read_bloom_data_data data = {
> +			.g = graph,
> +			.commit_graph_changed_paths_version = s->commit_graph_changed_paths_version
> +		};
>  		pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES,
>  			   &graph->chunk_bloom_indexes);
>  		read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA,
> -			   graph_read_bloom_data, graph);
> +			   graph_read_bloom_data, &data);
>  	}

Much of this block will not be necessary as we don't need to
send the repo settings into the read_chunk() method.

> +	if (r->settings.commit_graph_changed_paths_version < 0
> +	    || r->settings.commit_graph_changed_paths_version > 2) {
> +		warning(_("attempting to write a commit-graph, but 'commitgraph.changedPathsVersion' (%d) is not supported"),
> +			r->settings.commit_graph_changed_paths_version);
> +		return 0;

I see the "< 0" means we aren't considering the case of disabling 
writes with the zero value. We should exit early if we see a zero
here, for extra safety.

> +	}
> +	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2
> +		? 2 : 1;

Once we've checked that this value is not zero, we can do this:

	bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version;


>  		/* We have changed-paths already. Keep them in the next graph */
> -		if (g && g->chunk_bloom_data) {
> +		if (g && g->bloom_filter_settings) {
>  			ctx->changed_paths = 1;
>  			ctx->bloom_settings = g->bloom_filter_settings;
>  		}


> --- a/t/t4216-log-bloom.sh
> +++ b/t/t4216-log-bloom.sh
> @@ -450,4 +450,35 @@ test_expect_success 'version 1 changed-path used when version 1 requested' '
>  		test_bloom_filters_used "-- $CENT")
>  '
>  
> +test_expect_success 'version 1 changed-path not used when version 2 requested' '
> +	(cd highbit1 &&
> +		git config --add commitgraph.changedPathsVersion 2 &&
> +		test_bloom_filters_not_used "-- $CENT")
> +'

I think this test is the wrong behavior. We should be able to use version 1
when version 2 is requested.

Instead, start with version 1, and _upgrade_ to version 2 and then check
which version exists in the file.

We should only _not_ use the filters if the version is 0 (or
commitGraph.readChangedPaths=false). I think this might be a good enough reason
to 

> +
> +test_expect_success 'set up repo with high bit path, version 2 changed-path' '
> +	git init highbit2 &&
> +	git -C highbit2 config --add commitgraph.changedPathsVersion 2 &&
> +	test_commit -C highbit2 c2 "$CENT" &&
> +	git -C highbit2 commit-graph write --reachable --changed-paths
> +'
> +
> +test_expect_success 'check value of version 2 changed-path' '
> +	(cd highbit2 &&
> +		printf "c01f" >expect &&
> +		get_first_changed_path_filter >actual &&
> +		test_cmp expect actual)
> +'
> +
> +test_expect_success 'version 2 changed-path used when version 2 requested' '
> +	(cd highbit2 &&
> +		test_bloom_filters_used "-- $CENT")
> +'
> +
> +test_expect_success 'version 2 changed-path not used when version 1 requested' '
> +	(cd highbit2 &&
> +		git config --add commitgraph.changedPathsVersion 1 &&
> +		test_bloom_filters_not_used "-- $CENT")
> +'

Again, this is also the wrong situation.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux