Making split commit graphs pick up new options (namely --changed-paths)

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 10 Jun 2021 12:40:33 +0200

On Wed, Sep 16 2020, Taylor Blau wrote:

Replying to
http://lore.kernel.org/git/ccb6482feb8d8606d82b5ab97e33184f26d6c5b6.1600279373.git.me@xxxxxxxxxxxx
as a start-off point for discussion;

> Introduce a command-line flag to specify the maximum number of new Bloom
> filters that a 'git commit-graph write' is willing to compute from
> scratch.
>
> Prior to this patch, a commit-graph write with '--changed-paths' would
> compute Bloom filters for all selected commits which haven't already
> been computed (i.e., by a previous commit-graph write with '--split'
> such that a roll-up or replacement is performed).
>
> This behavior can cause prohibitively-long commit-graph writes for a
> variety of reasons:
>
>   * There may be lots of filters whose diffs take a long time to
>     generate (for example, they have close to the maximum number of
>     changes, diffing itself takes a long time, etc).
>
>   * Old-style commit-graphs (which encode filters with too many entries
>     as not having been computed at all) cause us to waste time
>     recomputing filters that appear to have not been computed only to
>     discover that they are too-large.
>
> This can make the upper-bound of the time it takes for 'git commit-graph
> write --changed-paths' to be rather unpredictable.
>
> To make this command behave more predictably, introduce
> '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters
> from scratch. This lets "computing" already-known filters proceed
> quickly, while bounding the number of slow tasks that Git is willing to
> do.
> [...]
> @@ -67,6 +67,11 @@ this option is given, future commit-graph writes will automatically assume
>  that this option was intended. Use `--no-changed-paths` to stop storing this
>  data.
>  +
> +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom
> +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is
> +enforced. Commits whose filters are not calculated are stored as a
> +length zero Bloom filter.
> ++
> [...]

Is there any way with an existing --split setup that introduces a
--changed-paths to make the "add bloom filters to the graph" eventually
consistent, or is some one-off --split=replace the only way to
grandfather in such a feature?

Reading the code there seems to be no way to do that, and we have the
"chunk_bloom_data" in the graph, as well as "bloom_filter_settings".

I'd expect some way to combine the "max_new_filters" and --split with
some eventual-consistency logic so that graphs not matching our current
settings are replaced, or replaced some <limit> at a time.

Also, am I reading the expire_commit_graphs() logic correctly that we
first write the split graph, and then unlink() things that are too old?
I.e. if you rely on the commit-graph to optimize things this will make
things slower until the next run of writing the graph?

I expected to find something more gentle there, i.e. marking that file
as obsolete, not making it part of the new chain (replacing it), and
then unlinking only things not part of the current chain of data that
are too old. But perhaps I'm just misreading or misunderstanding the
behavior...