Re: [PATCH 1/3] revision: complicated pathspecs disable filters

Taylor Blau <me@xxxxxxxxxxxx> · Wed, 15 Apr 2020 18:52:41 -0600

On Thu, Apr 16, 2020 at 12:18:33AM +0200, Jakub Narębski wrote:
> On Wed, 15 Apr 2020 at 20:37, Derrick Stolee <stolee@xxxxxxxxx> wrote:
> [...]
> > -->8--
> > From 89beb9598daabb19e3c896bbceeb0fc1b9ccc6ca Mon Sep 17 00:00:00 2001
> > From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> > Date: Wed, 15 Apr 2020 18:04:25 +0000
> > Subject: [PATCH] bloom: compute all Bloom hashes from lowercase
> >
> > The changed-path Bloom filters currently hash path strings using
> > the exact string for the path. This makes it difficult* to use the
> > filters when restricting to case-insensitive pathspecs.
> >
> > * I say "difficult" because it is possible to generate all 2^n
> >   options for the case of a path and test them all, but this is
> >   a bad idea and should not be done. "Impossible" is an appropriate
> >   alternative.
> >
> > THIS IS A BREAKING CHANGE. Commit-graph files with changed-path
> > Bloom filters computed by a previous commit will not be compatible
> > with the filters computed in this commit, nor will we get correct
> > results when testing across these incompatible versions. Normally,
> > this would be a completely unacceptable change, but the filters
> > have not been released and hence are still possible to update
> > before release.
> >
> > TODO: If we decide to move in this direction, then the following
> > steps should be done (and some of them should be done anyway):
> >
> > * We need to document the Bloom filter format to specify exactly
> >   how we compute the filter data. The details should be careful
> >   enough that someone can reproduce the exact file format without
> >   looking at the C code.
> >
> > * That document would include the tolower() transformation that is
> >   being done here.
>
> Why not modify the BDAT chunk to include version of
> case folding transformation or other collation algorithm
> (other transformation).that is done prior to computing
> the Bloom filter key? Though that might be unnecessary
> flexibility...

If this ends up being something that we want to do, I agree with
Stolee's reasoning that this should be a breaking change. If we were,
say, several months into having Bloom filters in a release and decided
at that point to make the change, then: sure, supporting both by writing
a bit in the BDAT chunk makes sense.

But, we're many months away from that state yet, and so I don't think
the cost of rebuilding what few commit-graphs exist with bloom filters
in them today to support both ordinary and lower-cased paths in the
filter.

Anyway, I'm still not sold on this idea in general (nor do I understand
it that others are), so I'll respond in more detail in another part of
the thread...

> For example the value of 0x00 in such field of BDAT
> chunk header would mean no transformation, while
> the value of 0x01 would mean per-character tolower()
> or Unicode equivalent of it.
>
> Best,
> --
> Jakub Narębski

Thanks,
Taylor