Re: [PATCH v4 07/14] bloom: split 'get_bloom_filter()' in two

SZEDER Gábor <szeder.dev@xxxxxxxxx> · Sat, 5 Sep 2020 21:04:50 +0200

On Sat, Sep 05, 2020 at 02:55:34PM -0400, Taylor Blau wrote:
> On Sat, Sep 05, 2020 at 02:38:54PM -0400, Taylor Blau wrote:
> > I don't know. I think my biggest objection is the size: we use the BIDX
> > chunk today to avoid having to write the length-zero Bloom filters; your
> > scheme would force us to write every filter. On the other hand, we could
> > continue to avoid writing length-zero filters, so long as the
> > commit-graph indicates that it knows this optimization.
> 
> Thinking about it a little bit more, I'm pretty sure that this isn't as
> easy as it sounds. Say that we:
> 
>   - continued to encode length-zero Bloom filters as equal adjacent
>     entries in the BIDX, but reserve the length-zero filter for commits
>     with no changed-paths, _or_ commits whose Bloom filters have not yet
>     been computed

No, use zero-length filters for commits whose Bloom filters have not
yet been computed, and use a one-byte all zero bits Bloom filter for
commits with no modified paths.

And this is exactly what I proposed earlier.

>   - write "too large" Bloom filters (i.e., commits with >= 512 changed
>     paths in a diff to their first parent) as a non-empty Bloom filter
>     with all bits set high.
> 
> I think we're still no better off today than before, because of the
> overloading in the length-zero Bloom filter. Because we would treat
> empty filters the same as ones that haven't been computed, we would
> recompute empty filters, and that would count against our
> '--max-new-filters' budget.
> 
> I don't see a non-convoluted way to split the overloaded length-zero
> case into something that is distinguishable without a format extension.

See above, no format extension needed.

> By the way, I think that your idea is good, and that it would be
> workable without the existing structure of the BIDX chunk (which itself
> made sense at the time that it was written).
> 
> So, I really want your idea to work. But, I think that ultimately the
> BFXL chunk is a more straightforward path forward.
> 
> 
> Thanks,
> Taylor