Re: Making split commit graphs pick up new options (namely --changed-paths)

Taylor Blau <me@xxxxxxxxxxxx> · Thu, 10 Jun 2021 20:50:02 -0400

On Fri, Jun 11, 2021 at 01:56:31AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >> Reading the code there seems to be no way to do that, and we have the
> >> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings".
> >>
> >> I'd expect some way to combine the "max_new_filters" and --split with
> >> some eventual-consistency logic so that graphs not matching our current
> >> settings are replaced, or replaced some <limit> at a time.
> >
> > This is asking about something slightly different, Bloom filter
> > settings rather than the existence of chagned-path Bloom filters
> > themselves. The Bloom settings aren't written to the commit-graph
> > although there has been some discussion about doing this in the past.
> >
> > If we ever did encode the Bloom settings, I imagine that accomplishing a
> > sort of "eventually replace all changed-path Bloom filters with these
> > new settings" would be as simple as considering all filters computed
> > under different settings to be "uncomputed".
>
> Yes, I don't have or know of a use-case for this now.
>
> But perhaps as the commit-graph format gets used more as a generic
> container for generated data about commits a thing like this would be
> useful, i.e. regenerating that part of the graph if the settings are
> different, particularly (as has been said) if it's not easy to discover
> the input setting from the data itself.

I wasn't completely correct about this (as Stolee noted and you ACK
below): we do store *some* settings in the format, but we don't store
the maximum number of entries in each filter. So today we would indicate
"computed, too large" for a commit that touches, say, 513 paths (the
limit is 512). But if you later ran:

  GIT_TEST_BLOOM_SETTINGS_MAX_CHANGED_PATHS=514 \
    git commit-graph write --split=replace --changed-paths

we would propagate the "too large" status forward and not bother to
recompute the filter (even though we could represent it if we did bother
to recompute it).

In the future we should probably include this as part of the filter
settings and recompute all filters in the commit-graph we're writing if
the value differs from the previous commit-graph. We have to recompute
it even if the maximum decreases since we can't ask the filter directly
"how many entries do you have?".

...and we have to assume that the limit is 512 on all commit-graphs that
exist in the wild today which do not explicit include this default
value. (That's not totally correct, of course, because you can set the
GIT_TEST_ variable, but it's close enough that I'd feel comfortable with
that rule in practice).

> So yeah, maybe we can just unlink() them right away, or another way to
> handle the race is that load_commit_graph_chain() could just try again
> from the beginning in such a case, and presumably picking up the fresh
> just-rewritten chain.

I'd probably be in favor of the latter.

Thanks,
Taylor