Re: [PATCH v6 0/7] Changed path filter hash fix and version bump

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Thu, 27 Jul 2023 13:53:08 -0700

Taylor Blau <me@xxxxxxxxxxxx> writes:
> > The intention in the current patch set was to not load it at all when we
> > have incompatible Bloom settings because it appeared quite troublesome
> > to notate which Bloom filter in memory is of which version. If we want
> > to copy forward existing results, we can change that, but I don't know
> > whether it's worth doing that (and if we think it's worth doing, this
> > should probably go in another patch set).
> 
> Yeah, I think having Bloom filters accessible from a commit-graph
> regardless of whether or not it matches our Bloom filter version is
> prerequisite to being able to implement something like this.
> 
> I feel like this is important enough to do in the same patch set, or the
> same release to avoid surprising operators when their commit-graph write
> suddenly recomputes all of its Bloom filters.

Suddenly reading many (or most) of the repo's trees would be a similar
surprise, right?

Also this would happen only if the server operator explicitly sets a
changed path filter version. If they leave it as-is, commit graphs will
still be written with the same version as the one on disk.

> Since we already store the Bloom version that we're using in the
> commit-graph file itself, shouldn't it be something along the lines of
> sticking that value onto the bloom_filter when we read its contents?
> 
> Although I don't think that you'd even need to annotate each individual
> filter, since you know that every pre-existing Bloom filter you are able
> to find has the version given by:
> 
>     the_repository->objects->commit_graph->bloom_filter_settings->hash_version
> 
> right?
> 
> Thanks,
> Taylor

Regarding consulting commit_graph->bloom_filter_settings->hash_version,
the issue I ran into was that firstly, I didn't know what to do about
commit_graph->base_graph (which also has its own bloom_filter_settings)
and what to do if it had a contradictory hash_version. And even if
we found a way to unify those, it is not true that every Bloom filter
in memory is of that version, since we may have generated some that
correspond to the version we're writing (not the version on disk).
In particular, the Bloom filters we write come from a commit slab
(bloom_filters in bloom.c) and in that slab, both Bloom filters from
disk and Bloom filters that were generated in-process coexist.

I also thought of your other proposal of augmenting struct bloom_filter
to also include the version. The issue I ran into there is if, for a
given commit, there already exists a Bloom filter read from disk with
the wrong version, what should we do? Overwrite it, or store both
versions in memory? (We can't just immediately output the Bloom filter
to disk and forget about the new version, only storing its size so that
we can generate the BIDX, because in the current code, generation and
writing to disk are separate. We could try to refactor it, but I didn't
want to make such a large change to reduce the possibility of bugs.)
Both storing the version number and storing an additional pointer for a
second version would increase memory consumption too, even when
supporting two versions isn't needed, but maybe this isn't a big deal.