Re: [PATCH v2 09/11] commit-graph: add --changed-paths option to write subcommand

Jakub Narebski <jnareb@xxxxxxxxx> · Tue, 25 Feb 2020 13:10:41 +0100

Garima Singh <garimasigit@xxxxxxxxx> writes:
> On 2/20/2020 3:28 PM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

[...]
>>> --- a/Documentation/git-commit-graph.txt
>>> +++ b/Documentation/git-commit-graph.txt
>>> @@ -54,6 +54,11 @@ or `--stdin-packs`.)
>>>  With the `--append` option, include all commits that are present in the
>>>  existing commit-graph file.
>>>  +
>>> +With the `--changed-paths` option, compute and write information about the
>>> +paths changed between a commit and it's first parent. This operation can
>>> +take a while on large repositories. It provides significant performance gains
>>> +for getting history of a directory or a file with `git log -- <path>`.
>>> ++
>> 
>> Should we write about limitation that the topmost layer in the split
>> commit graph needs to be written with '--changed-paths' for Git to use
>> this information?  Or perhaps we should try (in the future) to remove
>> this limitation?
>
> Given that this information is going to be used best effort, it would be 
> superfluous to describe every case and conditional that decides whether 
> this information is being used.

I can somewhat agree with this reasoning.

However what I would like to avoid is surprising users.  If one creates
base commit-graph with Bloom filters data, but then when creating
new layer of commit-graph (updating it incrementally), it may be
surprising that `git log -- <path>` is now much slower.

On the other hand if one would update commit-graph in a non-incremental
way (rewriting the commit-graph file), loosing the Bloom filter
information and performance of `git log -- <path>` because one forgot to
include `--changed-paths` is not that unexpected.

Anyway, in the future when this mechanism will be controlled by
appropriate config variable, this whole discussion would become somewhat
moot.

Thought for the future: perhaps `git commit-graph verify` could detect
that split graph has Bloom filters only for some layers, and inform the
user?  But that is almost certainly out of scope of this patch series.

>>> @@ -143,6 +146,8 @@ static int graph_write(int argc, const char **argv)
>>>  		flags |= COMMIT_GRAPH_WRITE_SPLIT;
>>>  	if (opts.progress)
>>>  		flags |= COMMIT_GRAPH_WRITE_PROGRESS;
>>> +	if (opts.enable_changed_paths)
>>> +		flags |= COMMIT_GRAPH_WRITE_BLOOM_FILTERS;
>>>  
>>>  	read_replace_refs = 0;
>> 
>> All right.  This actually turns on calculation Bloom filters for changed
>> paths, thanks to
>> 
>>  	ctx->changed_paths = flags & COMMIT_GRAPH_WRITE_BLOOM_FILTERS ? 1 : 0;
>> 
>> that was added by the "[PATCH v2 04/11] commit-graph: compute Bloom
>> filters for changed paths" patch.
>> 
>> Though... should this enabling be split into two separate patches like
>> this?
>
> The idea is that in 4/11 We compute only if the flag is set. 
> And between that patch and this one: we prepare the foundational code 
> that is now ready for that flag to be set via an opt-in by the user. 

All right.

Choosing how to split large change into series is not easy.  One one
hand one would want for each change to be small and self contained.  On
the other hand it would be good if each change was testable (test-tool
can help here).

Best,
-- 
Jakub Narębski