On Wed, Sep 16 2020, Taylor Blau wrote: Replying to http://lore.kernel.org/git/ccb6482feb8d8606d82b5ab97e33184f26d6c5b6.1600279373.git.me@xxxxxxxxxxxx as a start-off point for discussion; > Introduce a command-line flag to specify the maximum number of new Bloom > filters that a 'git commit-graph write' is willing to compute from > scratch. > > Prior to this patch, a commit-graph write with '--changed-paths' would > compute Bloom filters for all selected commits which haven't already > been computed (i.e., by a previous commit-graph write with '--split' > such that a roll-up or replacement is performed). > > This behavior can cause prohibitively-long commit-graph writes for a > variety of reasons: > > * There may be lots of filters whose diffs take a long time to > generate (for example, they have close to the maximum number of > changes, diffing itself takes a long time, etc). > > * Old-style commit-graphs (which encode filters with too many entries > as not having been computed at all) cause us to waste time > recomputing filters that appear to have not been computed only to > discover that they are too-large. > > This can make the upper-bound of the time it takes for 'git commit-graph > write --changed-paths' to be rather unpredictable. > > To make this command behave more predictably, introduce > '--max-new-filters=<n>' to allow computing at most '<n>' Bloom filters > from scratch. This lets "computing" already-known filters proceed > quickly, while bounding the number of slow tasks that Git is willing to > do. > [...] > @@ -67,6 +67,11 @@ this option is given, future commit-graph writes will automatically assume > that this option was intended. Use `--no-changed-paths` to stop storing this > data. > + > +With the `--max-new-filters=<n>` option, generate at most `n` new Bloom > +filters (if `--changed-paths` is specified). If `n` is `-1`, no limit is > +enforced. Commits whose filters are not calculated are stored as a > +length zero Bloom filter. > ++ > [...] Is there any way with an existing --split setup that introduces a --changed-paths to make the "add bloom filters to the graph" eventually consistent, or is some one-off --split=replace the only way to grandfather in such a feature? Reading the code there seems to be no way to do that, and we have the "chunk_bloom_data" in the graph, as well as "bloom_filter_settings". I'd expect some way to combine the "max_new_filters" and --split with some eventual-consistency logic so that graphs not matching our current settings are replaced, or replaced some <limit> at a time. Also, am I reading the expire_commit_graphs() logic correctly that we first write the split graph, and then unlink() things that are too old? I.e. if you rely on the commit-graph to optimize things this will make things slower until the next run of writing the graph? I expected to find something more gentle there, i.e. marking that file as obsolete, not making it part of the new chain (replacing it), and then unlinking only things not part of the current chain of data that are too old. But perhaps I'm just misreading or misunderstanding the behavior...