Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two

jnareb@xxxxxxxxx (Jakub Narębski) · Sat, 05 Sep 2020 19:22:08 +0200

Taylor Blau <me@xxxxxxxxxxxx> writes:

[...]
> While we're at it, instrument the new 'get_or_compute_bloom_filter()'
> with two counters in the 'write_commit_graph_context' struct which store
> the number of filters that we computed, and the number of those which
> were too large to store.

[...]
> @@ -1414,12 +1433,25 @@ static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>  		QSORT(sorted_commits, ctx->commits.nr, commit_gen_cmp);
>  
>  	for (i = 0; i < ctx->commits.nr; i++) {
> +		int computed = 0;
>  		struct commit *c = sorted_commits[i];
> -		struct bloom_filter *filter = get_bloom_filter(ctx->r, c, 1);
> +		struct bloom_filter *filter = get_or_compute_bloom_filter(
> +			ctx->r,
> +			c,
> +			1,
> +			&computed);
> +		if (computed) {
> +			ctx->count_bloom_filter_computed++;
> +			if (filter && !filter->len)
> +				ctx->count_bloom_filter_found_large++;

How do you distinguish between no changed paths stored because there
were no changes (which should not count as *_found_large), and no
changed paths stored because there were too many changes?  If I remember
it correctly in current implemetation both are represented as
zero-length filter (no changed paths could have been represented as all
zeros filter, too many changed paths could have been represented as all
ones filter).

No changes to store in filter can happen not only with `--allow-empty`
(e.g. via interactive rebase), but also with merge where all changes
came from the second parent -- we are storing only changes to first
parent, if I remember it correctly.

This is a minor issue, though.

> +		}
>  		ctx->total_bloom_filter_data_size += sizeof(unsigned char) * filter->len;
>  		display_progress(progress, i + 1);
>  	}
>  
> +	if (trace2_is_enabled())
> +		trace2_bloom_filter_write_statistics(ctx);
> +
>  	free(sorted_commits);
>  	stop_progress(&progress);
>  }

Best,
-- 
Jakub Narębski