Junio C Hamano <gitster@xxxxxxxxx> writes: > I may be misremembering the original discussion, but wasn't the > conclusion that v1 data is salvageable in the sense that it can > still reliably say that, given a pathname without bytes with > high-bit set, it cannot possibly belong to the set of changed paths, > even though, because the filter is contaminated with "signed" data, > its false-positive rate may be higher than using "unsigned" version? > And based on that observation, we can still read v1 data but only > use the Bloom filters when the queried paths have no byte with > high-bit set. There are at least 3 ways of salvaging the data that we've discussed: - Enumerating all of a repo's paths and if none of them have a high bit, retain the existing filters. - Walking all of a repo's trees (so that we know which tree corresponds to which commit) and if for a commit, all its trees have no high bit, retain the filter for that tree (otherwise recompute it). - Keep using a version 1 filter but only when the sought-for path has no high bit (as you describe here). (The first 2 is my interpretation of what Taylor described [1].) I'm not sure if we want to keep version 1 filters around at all - this would work with Git as long as it is not compiled with different signedness of char, but would not work with other implementations of Git (unless they replicate the hashing bug). There is also the issue of how we're going to indicate that in a commit graph file, some filters are version 1 and some filters are version 2 (unless the plan is to completely rewrite the filters in this case, but then we'll run into the issue that computing these filters en-masse is expensive, as Taylor describes also in [1]). > Also if we are operating in such an environment then would it be > possible to first compute as if we were going to generate v2 data, > but write it as v1 after reading all the path and realizing there > are no problematic paths? I think in this case, we would want to write it as v2 anyway, because there's no way to distinguish a v1 that has high bits and is written incorrectly versus a v1 that happens to have no high bits and therefore is identical under v2. > IOW, even if the version of Git is > capable of writing and reading v2, it does not have to use v2, > right? To put it the other way around, we will have to and can keep > supporting data that is labeled as v1, no? I think this is the main point - whether we want to continue supporting data labeled as v1. I personally think that we should migrate to v2 as quickly as possible. But if the consensus is that we should support both, at least for a few releases of Git, I'll go with that. [1] https://lore.kernel.org/git/ZF116EDcmAy7XEbC@nand.local/