Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two

Taylor Blau <me@xxxxxxxxxxxx> · Sat, 5 Sep 2020 14:55:34 -0400

On Sat, Sep 05, 2020 at 02:38:54PM -0400, Taylor Blau wrote:
> I don't know. I think my biggest objection is the size: we use the BIDX
> chunk today to avoid having to write the length-zero Bloom filters; your
> scheme would force us to write every filter. On the other hand, we could
> continue to avoid writing length-zero filters, so long as the
> commit-graph indicates that it knows this optimization.

Thinking about it a little bit more, I'm pretty sure that this isn't as
easy as it sounds. Say that we:

  - continued to encode length-zero Bloom filters as equal adjacent
    entries in the BIDX, but reserve the length-zero filter for commits
    with no changed-paths, _or_ commits whose Bloom filters have not yet
    been computed

  - write "too large" Bloom filters (i.e., commits with >= 512 changed
    paths in a diff to their first parent) as a non-empty Bloom filter
    with all bits set high.

I think we're still no better off today than before, because of the
overloading in the length-zero Bloom filter. Because we would treat
empty filters the same as ones that haven't been computed, we would
recompute empty filters, and that would count against our
'--max-new-filters' budget.

I don't see a non-convoluted way to split the overloaded length-zero
case into something that is distinguishable without a format extension.
By the way, I think that your idea is good, and that it would be
workable without the existing structure of the BIDX chunk (which itself
made sense at the time that it was written).

So, I really want your idea to work. But, I think that ultimately the
BFXL chunk is a more straightforward path forward.

Thanks,
Taylor