Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 24 May 2023 08:51:52 +0900

Derrick Stolee <derrickstolee@xxxxxxxxxx> writes:

> I appreciate that you discovered and are presenting a way out of this
> problem, however the current approach does not preserve compatibility
> enough.
> ...
> By changing this algorithm directly (instead of making an "unsigned" version,
> or renaming this one to the "maybe signed" version), you are making it
> impossible for us to ship a version that can read version 1 Bloom filters,
> so all read-only history operations will immediately slow down (because they
> will ignore v1 chunks, better than incorrectly parsing v1 chunks).
>
> Here's where we would ignore v1 filters, instead of continuing to read them
> (with all the risks involved).

I do not know the "all the risks involved" comment.  Is the risk
something we can mitigate by still reading v1 data but be careful
about when not to apply the filters?

I may be misremembering the original discussion, but wasn't the
conclusion that v1 data is salvageable in the sense that it can
still reliably say that, given a pathname without bytes with
high-bit set, it cannot possibly belong to the set of changed paths,
even though, because the filter is contaminated with "signed" data,
its false-positive rate may be higher than using "unsigned" version?
And based on that observation, we can still read v1 data but only
use the Bloom filters when the queried paths have no byte with
high-bit set.

Also if we are operating in such an environment then would it be
possible to first compute as if we were going to generate v2 data,
but write it as v1 after reading all the path and realizing there
are no problematic paths?  IOW, even if the version of Git is
capable of writing and reading v2, it does not have to use v2,
right?  To put it the other way around, we will have to and can keep
supporting data that is labeled as v1, no?

> In order for this to be something we can ship safely to environments that depend
> on changed-path Bloom filters, we need to be able to parse v1 filters. It would
> be even better if we didn't write v2 filters by default, but instead hid it
> behind a config option that is off by default for at least one major release.

Is the concern that we will double the chunk size because both v1
and v2 will be written?  Or is it that we will stop writing v1 if we
start writing v2 and switching too early will mean the repositories
will become slower for older implementations that haven't died out?

Thanks.