Re: [PATCH v6 0/7] Changed path filter hash fix and version bump

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Tue, 1 Aug 2023 11:52:32 -0700

Taylor Blau <me@xxxxxxxxxxxx> writes:
> On Thu, Jul 27, 2023 at 01:53:08PM -0700, Jonathan Tan wrote:
> > Suddenly reading many (or most) of the repo's trees would be a similar
> > surprise, right?
> 
> That's a good point. I think in general I'd expect Git to avoid
> recomputing Bloom filters where that work can be avoided, if the work in
> order to detect whether or not we need to recompute a filter is cheap
> enough to carry out.

Makes sense. I just don't think that there is a cheap way to detect if
a filter does not need to be recomputed (the closest way I think we have
is something that will require reading a lot of trees in a repo).

> > Also this would happen only if the server operator explicitly sets a
> > changed path filter version. If they leave it as-is, commit graphs will
> > still be written with the same version as the one on disk.
> 
> I think that I could live with that if the default is to leave things
> as-is.

Ah, thanks.

> I still think that it's worth it to have this functionality to propagate
> Bloom filters forward should ship in Git, but we can work on that
> outside of this series.

Makes sense.

> > Regarding consulting commit_graph->bloom_filter_settings->hash_version,
> > the issue I ran into was that firstly, I didn't know what to do about
> > commit_graph->base_graph (which also has its own bloom_filter_settings)
> > and what to do if it had a contradictory hash_version. And even if
> > we found a way to unify those, it is not true that every Bloom filter
> > in memory is of that version, since we may have generated some that
> > correspond to the version we're writing (not the version on disk).
> > In particular, the Bloom filters we write come from a commit slab
> > (bloom_filters in bloom.c) and in that slab, both Bloom filters from
> > disk and Bloom filters that were generated in-process coexist.
> 
> Would we ever want to have a commit-graph chain with mixed Bloom filter
> versions?

Probably not, but I wanted to be robust in case a third-party tool wrote a chain with
mixed versions.

> We avoid mixing generation number schemes across multiple layers of a
> commit-graph chain. But I don't see any reason that we should or need to
> have a similar restriction in place for the Bloom filter version. Both
> are readable, as long as the user-provided configuration allows them to
> be.

Yes, that's true - there is no inherent reason why we can't mix them,
unlike with generation numbers.

> We just have to interpret them differently depending on what layer of
> the graph (and therefore, what Bloom filter version they are) they come
> from.
> 
> Sorry for thinking aloud a little there. I think that this means that we
> at minimum have to keep in context the commit-graph layer we found the
> Bloom filter in so that we can tie that back to its Bloom filter
> version. That might just mean that we have to tag each Bloom filter with
> the version it was computed under, or maybe we already have the
> commit-graph layer in context, in which case we shouldn't need an
> additional field.
> 
> My gut is telling me that we probably *do* need such a field, since we
> don't often have a reference to the particular layer that we found a
> Bloom filter in, just the tip of the commit-graph chain that it came
> from.

We'll need the additional field because we don't know which commit graph
layer it comes from. In fact, we don't even know which *repo* the commit
comes from, since the commit slab is global. (Moving the slab to being
under a repo or under a commit graph layer would fix this.)

But I think there still remains the question of whether we really need
to support multiple versions in one Git invocation.

> > I also thought of your other proposal of augmenting struct bloom_filter
> > to also include the version. The issue I ran into there is if, for a
> > given commit, there already exists a Bloom filter read from disk with
> > the wrong version, what should we do? Overwrite it, or store both
> > versions in memory? (We can't just immediately output the Bloom filter
> > to disk and forget about the new version, only storing its size so that
> > we can generate the BIDX, because in the current code, generation and
> > writing to disk are separate. We could try to refactor it, but I didn't
> > want to make such a large change to reduce the possibility of bugs.)
> > Both storing the version number and storing an additional pointer for a
> > second version would increase memory consumption too, even when
> > supporting two versions isn't needed, but maybe this isn't a big deal.
> 
> It's likely that I'm missing something here, but what is stopping us
> from discarding the old Bloom filter as soon as we generate the new
> one? We shouldn't need to load the old filter again out of the commit
> slab, right?
> 
> Thanks,
> Taylor

I did not look at the code closely enough except to see that there was
a gap between generating the new-version Bloom filters and writing them
to disk, and I was concerned that now, or in the future, there would
be some code in that gap that reads the Bloom filter for a commit and
expects the old-version Bloom filter there.