Re: Making split commit graphs pick up new options (namely --changed-paths)

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 11 Jun 2021 01:56:31 +0200

On Thu, Jun 10 2021, Taylor Blau wrote:

> On Thu, Jun 10, 2021 at 12:40:33PM +0200, Ævar Arnfjörð Bjarmason wrote:
>> Is there any way with an existing --split setup that introduces a
>> --changed-paths to make the "add bloom filters to the graph" eventually
>> consistent, or is some one-off --split=replace the only way to
>> grandfather in such a feature?
>
> I'm assuming what you mean is "can I introduce changed-path Bloom
> filters into an existing split commit-graph with many layers without
> having to recompute the whole thing at once?" If so, then the answer is
> yes.
>
> Passing --changed-paths causes the commit-graph machinery to compute
> missing Bloom filters for every commit in the graph layer it is writing.
> So, if you do something like:
>
>   git commit-graph write --split --reachable --size-multiple=2 \
>     --changed-paths
>
> (--size-multiple=2 is the default, but I'm including it for clarity),
> then you'll get changed-path Bloom filters for all commits in the new
> layer, including any layers which may have been merged to create that
> layer.
>
> That all still respects `--max-new-filters`, with preference going to
> commits with lower generation numbers before higher ones (unless
> including commits from packs explicitly with --stdin-packs, in which
> case preference is given in pack order; see
> commit-graph.c:commit_pos_cmp() for details).
>
> t4216 shows this for --split=replace, but you could just as easily
> imagine a test like this:
>
>     #!/bin/sh
>
>     rm -fr repo
>     git init repo
>     cd repo
>
>     commit () {
>       >$1
>       git add $1
>       git commit -m "$1"
>     }
>
>     # no changed-path Bloom filter
>     commit missing
>     git commit-graph write --split --reachable --no-changed-paths
>
>     missing="$(git rev-parse HEAD)"
>     ~/src/git/t/helper/test-tool bloom get_filter_for_commit "$missing"
>
>     # >= 2x the size of the previous layer, so they will be merged
>     commit bloom1
>     commit bloom2
>     git commit-graph write --split --reachable --changed-paths
>
>     # and the $missing commit has a Bloom filter
>     ~/src/git/t/helper/test-tool bloom get_filter_for_commit "$missing"
>
> (One caveat is that if you run that script unmodified, you'll get a
> filter on both invcations of the test-tool: that's because it computes
> filters on the fly if they are missing. You can change that by s/1/0 in
> the call to get_or_compute_bloom_filter()).

I was assuming it would not be added, but it seems I was just wrong
about that.

What prompted this question from me was reviewing a MR dealing with the
format, where the code is parsing graph files checking if they have
bloom filters, and then using --split or --split=replace:
https://gitlab.com/gitlab-org/gitaly/-/merge_requests/3545#note_596862825

>> Reading the code there seems to be no way to do that, and we have the
>> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings".
>>
>> I'd expect some way to combine the "max_new_filters" and --split with
>> some eventual-consistency logic so that graphs not matching our current
>> settings are replaced, or replaced some <limit> at a time.
>
> This is asking about something slightly different, Bloom filter
> settings rather than the existence of chagned-path Bloom filters
> themselves. The Bloom settings aren't written to the commit-graph
> although there has been some discussion about doing this in the past.
>
> If we ever did encode the Bloom settings, I imagine that accomplishing a
> sort of "eventually replace all changed-path Bloom filters with these
> new settings" would be as simple as considering all filters computed
> under different settings to be "uncomputed".

Yes, I don't have or know of a use-case for this now.

But perhaps as the commit-graph format gets used more as a generic
container for generated data about commits a thing like this would be
useful, i.e. regenerating that part of the graph if the settings are
different, particularly (as has been said) if it's not easy to discover
the input setting from the data itself.

>> Also, am I reading the expire_commit_graphs() logic correctly that we
>> first write the split graph, and then unlink() things that are too old?
>> I.e. if you rely on the commit-graph to optimize things this will make
>> things slower until the next run of writing the graph?
>
> Before expire_commit_graphs() is called, we call mark_commit_graphs()
> which freshens the mtimes of all surviving commit-graph layers, and then
> expire_commit_graphs() removes the stale layers. I'm not sure what
> things getting slower is referring to since the resulting commit-graph
> has at least as many commits as the commit-graph that existed prior to
> the write.

I was assuming we'd be left without a segment of the data, but now I
seem to recall that the commit-graph can't have missing parts of the
graph. So this was doubly a brainfart on my part.

>> I expected to find something more gentle there [...]
>
> FWIW, I also find this "expire based on mtimes" thing a little odd for
> writing split commit-graphs because we know exactly which layers we want
> to get rid of. I suspect that the reuse comes from wanting to unify the
> logic for handling '--expire-time' with the expiration that happens
> after writing a split commit-graph that merged two or more previous
> layers.
>
> I would probably change mark_commit_graphs() to remove those merged
> layers explicitly (but still run expire_commit_graphs() to handle
> --expire-time). But, come to think of it... if merging >2 layers already
> causes the merged layers to be removed, then why would you ever set an
> --expire-time yourself?

What Derrick said downthread, but you might also find this discussion
(and upthread of that) on the split graph format interesting:
https://lore.kernel.org/git/874l63xpwx.fsf@xxxxxxxxxxxxxxxxxxx/

I.e. the initial idea was to have a numeric split
(commit-graph-{1,2,3,...}, and juggle those "chunks". It seems I managed
to convince Derrick to go with the current "chained hash" approach.

That's related to the way we want to do expiry. I.e. you might think
that layer isn't used, but a concurrent process working over NFS may
still want it, especially if that process itself is a long-running
command.

Reading the code now I think it can be much more aggressive than it
currently is, even say to the point of the "commit-graph write"
optionally forking to the background and doing the unlink() itself in
10-60 seconds or something.

I.e. in load_commit_graph_chain() we always load full chain once we
start doing so, we don't load one part of it, find the data we want, and
then go to a bunch of more other work outside of commit-graph.c, only to
return to loading the next segment in the chain much later. It happens
as fast as the loop around load_commit_graph_one will run.

That loop would be less racy if it pre-opened all the files it wants, so
the consuming process would hang onto them past an unlink() by a
concurrent process.

It's probably a non-issue in practice, but one can imagine e.g. a set of
10x 10MB commit-graph files being loaded over a 10KB/s NFS connection or
something. Opening and parsing all the files could take a while. By the
time we open() the last one an aggressive expiry policy might have
removed it from under us.

Anyway, even then the worst we'll do is warn about not finding the
segment in the chain and return the graph up until that point. I.e.:

    warning(_("unable to find all commit-graph files"));

So yeah, maybe we can just unlink() them right away, or another way to
handle the race is that load_commit_graph_chain() could just try again
from the beginning in such a case, and presumably picking up the fresh
just-rewritten chain.