Re: [PATCH v2 04/11] commit-graph: compute Bloom filters for changed paths

Jakub Narebski <jnareb@xxxxxxxxx> · Sun, 23 Feb 2020 18:34:22 +0100

Garima Singh <garimasigit@xxxxxxxxx> writes:
> On 2/17/2020 4:56 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

[...]
>>> ---
>>>  commit-graph.c | 32 +++++++++++++++++++++++++++++++-
>>>  commit-graph.h |  3 ++-
>>>  2 files changed, 33 insertions(+), 2 deletions(-)
>> 
>> It would be good to have at least sanity check of this feature, perhaps
>> one that would check that the number of per-commit Bloom filters on slab
>> matches the number of commits in the commit-graph.
>
> The combination of all the e2e tests in this series with the test
> flag being turned on in the CI, and the performance gains we are seeing
> confirm that this is happening correctly.

Well, the advantage of unit tests over e2e functional tests is that they
can pinpoint the source of bug much more easily.

That said, I don't think there is absolute need for unit tests here,
though it would be nice to have them.

>>>  
>>>  	const struct split_commit_graph_opts *split_opts;
>>> +	uint32_t total_bloom_filter_data_size;
>> 
>> This is total size of Bloom filters data, in bytes, that will later be
>> used for BDAT chunk size.  However the commit-graph format uses 8 bytes
>> for byte-offset, not 4 bytes.  Why it is uint32_t and not uint64_t then?
>
> Changed to size_t. Thanks for noticing! 

Right, this is a local value (size_t may be different size on different
architectures), even though it will be stored indirectly in chunk lookup
table as pair of uint64_t offsets.

>>>  };
>>>  
>>>  static void write_graph_chunk_fanout(struct hashfile *f,
>>> @@ -1140,6 +1143,28 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>  	stop_progress(&ctx->progress);
>>>  }
>>>  
>>> +static void compute_bloom_filters(struct write_commit_graph_context *ctx)
>>> +{
>>> +	int i;
>>> +	struct progress *progress = NULL;
>>> +
>>> +	load_bloom_filters();
>>> +
>>> +	if (ctx->report_progress)
>>> +		progress = start_progress(
>>> +			_("Computing commit diff Bloom filters"),
>>> +			ctx->commits.nr);
>>> +
>> 
>> Shouldn't we initialize ctx->total_bloom_filter_data_size to 0 here?  We
>> cannot use compute_bloom_filters() to _update_ Bloom filters data, I
>> think -- we don't distinguish here between new and existing data (where
>> existing data size is already included in total Bloom filters size).  At
>> least I don't think so.
>> 
>
> This line in commit-graph.c takes care of reinitializing the graph context and
> by consequence the bloom filter data size.
>
>   ctx = xcalloc(1, sizeof(struct write_commit_graph_context));
>   
> So the total size gets recalculated every time, which is correct. 

True, I have missed this.

Best,
-- 
Jakub Narębski