Re: [PATCH 2/2] commit-graph: fix murmur3, bump filter ver. to 2

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Wed, 24 May 2023 14:26:57 -0700

Junio C Hamano <gitster@xxxxxxxxx> writes:
> I may be misremembering the original discussion, but wasn't the
> conclusion that v1 data is salvageable in the sense that it can
> still reliably say that, given a pathname without bytes with
> high-bit set, it cannot possibly belong to the set of changed paths,
> even though, because the filter is contaminated with "signed" data,
> its false-positive rate may be higher than using "unsigned" version?
> And based on that observation, we can still read v1 data but only
> use the Bloom filters when the queried paths have no byte with
> high-bit set.

There are at least 3 ways of salvaging the data that we've discussed:

- Enumerating all of a repo's paths and if none of them have a high bit,
  retain the existing filters.
- Walking all of a repo's trees (so that we know which tree corresponds
  to which commit) and if for a commit, all its trees have no high bit,
  retain the filter for that tree (otherwise recompute it).
- Keep using a version 1 filter but only when the sought-for path has no
  high bit (as you describe here).

(The first 2 is my interpretation of what Taylor described [1].)

I'm not sure if we want to keep version 1 filters around at all -
this would work with Git as long as it is not compiled with different
signedness of char, but would not work with other implementations of
Git (unless they replicate the hashing bug). There is also the issue of
how we're going to indicate that in a commit graph file, some filters
are version 1 and some filters are version 2 (unless the plan is to
completely rewrite the filters in this case, but then we'll run into
the issue that computing these filters en-masse is expensive, as Taylor
describes also in [1]).

> Also if we are operating in such an environment then would it be
> possible to first compute as if we were going to generate v2 data,
> but write it as v1 after reading all the path and realizing there
> are no problematic paths?  

I think in this case, we would want to write it as v2 anyway, because
there's no way to distinguish a v1 that has high bits and is written
incorrectly versus a v1 that happens to have no high bits and therefore
is identical under v2.

> IOW, even if the version of Git is
> capable of writing and reading v2, it does not have to use v2,
> right?  To put it the other way around, we will have to and can keep
> supporting data that is labeled as v1, no?

I think this is the main point - whether we want to continue supporting
data labeled as v1. I personally think that we should migrate to v2
as quickly as possible. But if the consensus is that we should support
both, at least for a few releases of Git, I'll go with that.

[1] https://lore.kernel.org/git/ZF116EDcmAy7XEbC@nand.local/