Re: [PATCH v6 0/7] Changed path filter hash fix and version bump

Taylor Blau <me@xxxxxxxxxxxx> · Tue, 1 Aug 2023 14:08:50 -0400

On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote:
> Taylor Blau <me@xxxxxxxxxxxx> writes:
> > > The intention in the current patch set was to not load it at all when we
> > > have incompatible Bloom settings because it appeared quite troublesome
> > > to notate which Bloom filter in memory is of which version. If we want
> > > to copy forward existing results, we can change that, but I don't know
> > > whether it's worth doing that (and if we think it's worth doing, this
> > > should probably go in another patch set).
> >
> > Yeah, I think having Bloom filters accessible from a commit-graph
> > regardless of whether or not it matches our Bloom filter version is
> > prerequisite to being able to implement something like this.
> >
> > I feel like this is important enough to do in the same patch set, or the
> > same release to avoid surprising operators when their commit-graph write
> > suddenly recomputes all of its Bloom filters.
>
> Suddenly reading many (or most) of the repo's trees would be a similar
> surprise, right?

That's a good point. I think in general I'd expect Git to avoid
recomputing Bloom filters where that work can be avoided, if the work in
order to detect whether or not we need to recompute a filter is cheap
enough to carry out.

> Also this would happen only if the server operator explicitly sets a
> changed path filter version. If they leave it as-is, commit graphs will
> still be written with the same version as the one on disk.

I think that I could live with that if the default is to leave things
as-is.

I still think that it's worth it to have this functionality to propagate
Bloom filters forward should ship in Git, but we can work on that
outside of this series.

> Regarding consulting commit_graph->bloom_filter_settings->hash_version,
> the issue I ran into was that firstly, I didn't know what to do about
> commit_graph->base_graph (which also has its own bloom_filter_settings)
> and what to do if it had a contradictory hash_version. And even if
> we found a way to unify those, it is not true that every Bloom filter
> in memory is of that version, since we may have generated some that
> correspond to the version we're writing (not the version on disk).
> In particular, the Bloom filters we write come from a commit slab
> (bloom_filters in bloom.c) and in that slab, both Bloom filters from
> disk and Bloom filters that were generated in-process coexist.

Would we ever want to have a commit-graph chain with mixed Bloom filter
versions?

We avoid mixing generation number schemes across multiple layers of a
commit-graph chain. But I don't see any reason that we should or need to
have a similar restriction in place for the Bloom filter version. Both
are readable, as long as the user-provided configuration allows them to
be.

We just have to interpret them differently depending on what layer of
the graph (and therefore, what Bloom filter version they are) they come
from.

Sorry for thinking aloud a little there. I think that this means that we
at minimum have to keep in context the commit-graph layer we found the
Bloom filter in so that we can tie that back to its Bloom filter
version. That might just mean that we have to tag each Bloom filter with
the version it was computed under, or maybe we already have the
commit-graph layer in context, in which case we shouldn't need an
additional field.

My gut is telling me that we probably *do* need such a field, since we
don't often have a reference to the particular layer that we found a
Bloom filter in, just the tip of the commit-graph chain that it came
from.

> I also thought of your other proposal of augmenting struct bloom_filter
> to also include the version. The issue I ran into there is if, for a
> given commit, there already exists a Bloom filter read from disk with
> the wrong version, what should we do? Overwrite it, or store both
> versions in memory? (We can't just immediately output the Bloom filter
> to disk and forget about the new version, only storing its size so that
> we can generate the BIDX, because in the current code, generation and
> writing to disk are separate. We could try to refactor it, but I didn't
> want to make such a large change to reduce the possibility of bugs.)
> Both storing the version number and storing an additional pointer for a
> second version would increase memory consumption too, even when
> supporting two versions isn't needed, but maybe this isn't a big deal.

It's likely that I'm missing something here, but what is stopping us
from discarding the old Bloom filter as soon as we generate the new
one? We shouldn't need to load the old filter again out of the commit
slab, right?

Thanks,
Taylor