Derrick Stolee <derrickstolee@xxxxxxxxxx> writes: > I appreciate that you discovered and are presenting a way out of this > problem, however the current approach does not preserve compatibility > enough. > ... > By changing this algorithm directly (instead of making an "unsigned" version, > or renaming this one to the "maybe signed" version), you are making it > impossible for us to ship a version that can read version 1 Bloom filters, > so all read-only history operations will immediately slow down (because they > will ignore v1 chunks, better than incorrectly parsing v1 chunks). > > Here's where we would ignore v1 filters, instead of continuing to read them > (with all the risks involved). I do not know the "all the risks involved" comment. Is the risk something we can mitigate by still reading v1 data but be careful about when not to apply the filters? I may be misremembering the original discussion, but wasn't the conclusion that v1 data is salvageable in the sense that it can still reliably say that, given a pathname without bytes with high-bit set, it cannot possibly belong to the set of changed paths, even though, because the filter is contaminated with "signed" data, its false-positive rate may be higher than using "unsigned" version? And based on that observation, we can still read v1 data but only use the Bloom filters when the queried paths have no byte with high-bit set. Also if we are operating in such an environment then would it be possible to first compute as if we were going to generate v2 data, but write it as v1 after reading all the path and realizing there are no problematic paths? IOW, even if the version of Git is capable of writing and reading v2, it does not have to use v2, right? To put it the other way around, we will have to and can keep supporting data that is labeled as v1, no? > In order for this to be something we can ship safely to environments that depend > on changed-path Bloom filters, we need to be able to parse v1 filters. It would > be even better if we didn't write v2 filters by default, but instead hid it > behind a config option that is off by default for at least one major release. Is the concern that we will double the chunk size because both v1 and v2 will be written? Or is it that we will stop writing v1 if we start writing v2 and switching too early will mean the repositories will become slower for older implementations that haven't died out? Thanks.