Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two

Taylor Blau <me@xxxxxxxxxxxx> · Sat, 5 Sep 2020 13:38:25 -0400

On Sat, Sep 05, 2020 at 07:22:08PM +0200, Jakub Narębski wrote:
> Taylor Blau <me@xxxxxxxxxxxx> writes:
> How do you distinguish between no changed paths stored because there
> were no changes (which should not count as *_found_large), and no
> changed paths stored because there were too many changes?  If I remember
> it correctly in current implemetation both are represented as
> zero-length filter (no changed paths could have been represented as all
> zeros filter, too many changed paths could have been represented as all
> ones filter).

Right, once we get handed back a filter from
'get_or_compute_bloom_filter()', we can't distinguish between (a) a
commit with too many changes to store in a single Bloom filter, and (b)
a commit with no changes at all.

It's unfortunate that callers can't pick between the two, but this
implementation is actually an improvement on the status-quo! Why?
Because right now we'll see an "empty" Bloom filter and recompute it
because it's "missing", only to discover that it has no changes.

With this patch, we'll say "this filter looks too large", and stop
computing it, because we have already gone through the effort to compute
it once (and marked it in the BFXL chunk).

Now, you could certainly argue that 'BFXL' could be called 'BFWC'
("Bloom filter was computed"), and then the "was this filter too large"
check means that the commit was (a) marked in a BFWC chunk, and (b) has
length zero. I'm not necessarily opposed, but I think that this is
probably not worth it, since we're trying to disambiguate something that
is inherently ambiguous. (That is, even with this new check, a
length-zero would still be thought to be "too large", since it was
computed, and has length 0).

> No changes to store in filter can happen not only with `--allow-empty`
> (e.g. via interactive rebase), but also with merge where all changes
> came from the second parent -- we are storing only changes to first
> parent, if I remember it correctly.

Agreed. And yes, Bloom filters store changes only to their commit's
first parent.

> This is a minor issue, though.

Thanks for raising it. I don't think that this is a show-stopper for
this series.

Thanks,
Taylor