Re: [PATCH v3 12/14] commit-graph: add large-filters bitmap chunk

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 11 Aug 2020 17:11:49 -0400

On 8/11/20 4:52 PM, Taylor Blau wrote:
> To allow using the existing bitmap code with 64-bit words, we write the
> data in network byte order from the 64-bit words. This means we also
> need to read the array from the commit-graph file by translating each
> word from network byte order using get_be64() when loading the commit
> graph. (Note that this *could* be delayed until first-use, but a later
> patch will rely on this being initialized early, so we assume the
> up-front cost when parsing instead of delaying initialization).

I don't think this is correct to do. This means that every commit-graph
load will load the entire large bloom filter chunk into memory before
allowing a single commit to be read from the graph.

In the case of a very large commit-graph (> 1 million commits), this
would cause a significant initial cost that is not necessarily needed
for anything. For example, "git log -1" will be delayed by this action.

If I understand correctly, this bloom_large bitmap is only used when
writing Bloom filters. At that point, reading the entire bitmap from
disk into memory is inexpensive compared to the time saved by the
feature.

> @@ -429,6 +430,24 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  				graph->bloom_filter_settings->bits_per_entry = get_be32(data + chunk_offset + 8);
>  			}
>  			break;
> +
> +		case GRAPH_CHUNKID_BLOOMLARGE:
> +			if (graph->chunk_bloom_large_filters)
> +				chunk_repeated = 1;
> +			else if (r->settings.commit_graph_read_changed_paths) {

This is guarded against commitGraph.readChangedPaths, which is good,
but that's not enough to claim that we need the bloom_large bitmap
in this process.

> +				size_t alloc = get_be64(chunk_lookup + 4) - chunk_offset - sizeof(uint32_t);

If we store this inside the commit_graph struct, we can save this
size for later so...

> +				graph->chunk_bloom_large_filters = data + chunk_offset + sizeof(uint32_t);
> +				graph->bloom_filter_settings->max_changed_paths = get_be32(data + chunk_offset);

...this portion can be done only when we are about to read from the
bitmap.

> +				if (alloc) {
> +					size_t j;
> +					graph->bloom_large = bitmap_word_alloc(alloc);
> +
> +					for (j = 0; j < graph->bloom_large->word_alloc; j++)
> +						graph->bloom_large->words[j] = get_be64(
> +							graph->chunk_bloom_large_filters + j * sizeof(eword_t));
> +				}

> +			}
> +			break;
>  		}
>  
>  		if (chunk_repeated) {
> @@ -443,7 +462,9 @@ struct commit_graph *parse_commit_graph(struct repository *r,
>  		/* We need both the bloom chunks to exist together. Else ignore the data */
>  		graph->chunk_bloom_indexes = NULL;
>  		graph->chunk_bloom_data = NULL;
> +		graph->chunk_bloom_large_filters = NULL;
>  		FREE_AND_NULL(graph->bloom_filter_settings);
> +		bitmap_free(graph->bloom_large);

Perhaps this bitmap_free() needs a check to see if we
initialized it (with my recommended lazy-loading), but
maybe not?

>  	}
>  
>  	hashcpy(graph->oid.hash, graph->data + graph->data_len - graph->hash_len);
> @@ -932,6 +953,20 @@ struct tree *get_commit_tree_in_graph(struct repository *r, const struct commit
>  	return get_commit_tree_in_graph_one(r, r->objects->commit_graph, c);
>  }
>  
> +static int get_bloom_filter_large_in_graph(struct commit_graph *g,
> +					   const struct commit *c)
> +{
> +	uint32_t graph_pos = commit_graph_position(c);
> +	if (graph_pos == COMMIT_NOT_FROM_GRAPH)
> +		return 0;
> +
> +	while (g && graph_pos < g->num_commits_in_base)
> +		g = g->base_graph;
> +
> +	if (!(g && g->bloom_large))
> +		return 0;

Here is where we can check if the size of the chunk is
non-zero and if the g->bloom_large bitmap is uninitialized.
Since we are caring about this information, it is now time
to do the network-byte transition of the full bitmap.

> +	return bitmap_get(g->bloom_large, graph_pos - g->num_commits_in_base);
> +}
>  
Thanks,
-Stolee