Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand

Garima Singh <garimasigit@xxxxxxxxx> · Mon, 24 Feb 2020 16:51:48 -0500

On 2/20/2020 3:28 PM, Jakub Narebski wrote:
> "Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
>> From: Garima Singh <garima.singh@xxxxxxxxxxxxx>
>>
>> Add --changed-paths option to git commit-graph write. This option will
>> allow users to compute information about the paths that have changed
>> between a commit and its first parent, and write it into the commit graph
>> file. If the option is passed to the write subcommand we set the
>> COMMIT_GRAPH_WRITE_BLOOM_FILTERS flag and pass it down to the
>> commit-graph logic.
> 
> In the manpage you write that this operation (computing Bloom filters)
> can take a while on large repositories.  Could you perhaps provide some
> numbers: how much longer does it take to write commit-graph file with
> and without '--changed-paths' for example for Linux kernel, or some
> other large repository?  Thanks in advance.
> 

Yes. Will include numbers as appropriate in v3. 

>>
>> Helped-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>> Signed-off-by: Garima Singh <garima.singh@xxxxxxxxxxxxx>
>> ---
>>  Documentation/git-commit-graph.txt | 5 +++++
>>  builtin/commit-graph.c             | 9 +++++++--
>>  2 files changed, 12 insertions(+), 2 deletions(-)
> 
> What is missing is some sanity tests: that bloom index and bloom data
> chunks are not present without '--changed-paths', and that they are
> added with '--changed-paths'.
> 
> If possible, maybe also check in a separate test that the size of
> bloom_index chunk agrees with the number of commits in the commit graph.
> 
> 
> Also, we can now add those tests I have wrote about in my review of
> previous patch, that is:
> 
> 1. If you write commit-graph with --changed-paths, and either add some
>    commits later or exclude some commits from the commit graph, then:
> 
>    a.) commit(s) in commit-graph have Bloom filter
>    b.) commit(s) not in commit-graph do not have Bloom filter
> 
> 2. If you write commit-graph without --changed-paths as base layer,
>    and then write next layer with --changed-paths and --split, then:
> 
>    a.) commit(s) in top layer have Bloom filter(s)
>    b.) commit(s) in bottom layer don't have Bloom filter(s)
> 

I will see what more can be done here. 

>>
>> diff --git a/Documentation/git-commit-graph.txt b/Documentation/git-commit-graph.txt
>> index bcd85c1976..907d703b30 100644
>> --- a/Documentation/git-commit-graph.txt
>> +++ b/Documentation/git-commit-graph.txt
>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>  With the `--append` option, include all commits that are present in the
>>  existing commit-graph file.
>>  +
>> +With the `--changed-paths` option, compute and write information about the
>> +paths changed between a commit and it's first parent. This operation can
>> +take a while on large repositories. It provides significant performance gains
>> +for getting history of a directory or a file with `git log -- <path>`.
>> ++
> 
> Should we write about limitation that the topmost layer in the split
> commit graph needs to be written with '--changed-paths' for Git to use
> this information?  Or perhaps we should try (in the future) to remove
> this limitation??
> 

Given that this information is going to be used best effort, it would be 
superfluous to describe every case and conditional that decides whether 
this information is being used.
>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>  	if (opts.progress)
>>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>> +	if (opts.enable_changed_paths)
>> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>  
>>  	read_replace_refs = 0;
> 
> All right.  This actually turns on calculation Bloom filters for changed
> paths, thanks to
> 
>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
> 
> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
> filters for changed paths" patch.
> 
> Though... should this enabling be split into two separate patches like
> this?
> 

The idea is that in 4/11 We compute only if the flag is set. 
And between that patch and this one: we prepare the foundational code 
that is now ready for that flag to be set via an opt-in by the user. 

> 
> Best,
>