Re: [PATCH 10/12] bloom: encode out-of-bounds filters as non-empty

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 11 Sep 2020 14:15:18 -0400

On 9/10/2020 11:45 AM, Taylor Blau wrote:
> On Wed, Sep 09, 2020 at 11:35:57PM -0400, Taylor Blau wrote:
>> On Wed, Sep 09, 2020 at 11:23:48AM -0400, Taylor Blau wrote:
>>> -		filter->data = NULL;
>>> -		filter->len = 0;
>>> +		filter->data = xmalloc(1);
>>> +		filter->data[0] = 0xFF;
>>> +		filter->len = 1;
>>>
>>>  		if (computed)
>>>  			*computed |= BLOOM_TRUNC_LARGE;
>>
>> Oops, I missed the case that added by the previous patch where the
>> number of diff entries is smaller than the limit, but the hashmap
>> entries (after directories are added and such) crosses the threshold.
>>
>> Specifically, this patch doesn't write the 0xFF filter like it should.
>> I'll send a different version of this patch tomorrow.
> 
> This one should do the trick. Let's use it instead.
> 
> --- >8 ---
> 
> Subject: [PATCH] bloom: encode out-of-bounds filters as non-empty
> 
> When a changed-path Bloom filter has either zero, or more than a
> certain number (commonly 512) of entries, the commit-graph machinery
> encodes it as "missing". More specifically, it sets the indices adjacent
> in the BIDX chunk as equal to each other to indicate a "length 0"
> filter; that is, that the filter occupies zero bytes on disk.
> 
> This has heretofore been fine, since the commit-graph machinery has no
> need to care about these filters with too few or too many changed paths.
> Both cases act like no filter has been generated at all, and so there is
> no need to store them.
> 
> In a subsequent commit, however, the commit-graph machinery will learn
> to only compute Bloom filters for some commits in the current
> commit-graph layer. This is a change from the current implementation
> which computes Bloom filters for all commits that are in the layer being
> written. Critically for this patch, only computing some of the Bloom
> filters means adding a third state for length 0 Bloom filters: zero
> entries, too many entries, or "hasn't been computed".
> 
> It will be important for that future patch to distinguish between "not
> representable" (i.e., zero or too-many changed paths), and "hasn't been
> computed". In particular, we don't want to waste time recomputing
> filters that have already been computed.
> 
> To that end, change how we store Bloom filters in the "computed but not
> representable" category:
> 
>   - Bloom filters with no entries are stored as a single byte with all
>     bits low (i.e., all queries to that Bloom filter will return
>     "definitely not")
> 
>   - Bloom filters with too many entries are stored as a single byte with
>     all bits set high (i.e., all queries to that Bloom filter will
>     return "maybe").
> 
> These rules are sufficient to not incur a behavior change by changing
> the on-disk representation of these two classes. Likewise, no
> specification changes are necessary for the commit-graph format, either:
> 
>   - Filters that were previously empty will be recomputed and stored
>     according to the new rules, and
> 
>   - old clients reading filters generated by new clients will interpret
>     the filters correctly and be none the wiser to how they were
>     generated.
> 
> Clients will invoke the Bloom machinery in more cases than before, but
> this can be addressed by returning a NULL filter when all bits are set
> high. This can be addressed in a future patch.
> 
> Finally, note that this does increase the size of on-disk commit-graphs,
> but far less than other proposals. In particular, this is generally more
> efficient than storing a bitmap for which commits haven't computed their
> Bloom filters. Storing a bitmap incurs a penalty of one bit per commit,
> whereas storing explicit filters as above incurs a penalty of one byte
> per too-large or too-small commit.
> 
> In practice, these boundary commits likely occupy a small proportion of
> the overall number of commits, and so the size penalty is likely smaller
> than storing a bitmap for all commits.
> 
> A test to exercise filters which contain too many changed path entries
> will be introduced in a subsequent patch.
> 
> Suggested-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
> Suggested-by: Jakub Narębski <jnareb@xxxxxxxxx>
> Helped-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx>
> ---
>  .../technical/commit-graph-format.txt         |  2 +-
>  bloom.c                                       | 16 ++++++++++++--
>  bloom.h                                       |  1 +
>  commit-graph.c                                |  5 +++++
>  t/t0095-bloom.sh                              |  8 +++----
>  t/t4216-log-bloom.sh                          | 21 +++++++++++++++++--
>  6 files changed, 44 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index 6ddbceba15..6585f1948a 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -125,7 +125,7 @@ CHUNK DATA:
>      * The rest of the chunk is the concatenation of all the computed Bloom
>        filters for the commits in lexicographic order.
>      * Note: Commits with no changes or more than 512 changes have Bloom filters
> -      of length zero.
> +      of length one, with either all bits set to zero or one respectively.
>      * The BDAT chunk is present if and only if BIDX is present.
> 
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
> diff --git a/bloom.c b/bloom.c
> index db9fb82437..d24747a1d5 100644
> --- a/bloom.c
> +++ b/bloom.c
> @@ -177,6 +177,13 @@ static int pathmap_cmp(const void *hashmap_cmp_fn_data,
>  	return strcmp(e1->path, e2->path);
>  }
> 
> +static void init_truncated_large_filter(struct bloom_filter *filter)
> +{
> +	filter->data = xmalloc(1);
> +	filter->data[0] = 0xFF;
> +	filter->len = 1;
> +}
> +

So this hunk is essentially the new bit in this version.

> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 232ba2c485..7e4ab1795f 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -71,8 +71,8 @@ test_expect_success 'get bloom filters for commit with no changes' '
>  	git init &&
>  	git commit --allow-empty -m "c0" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:0
> -	Filter_Data:
> +	Filter_Length:1
> +	Filter_Data:00|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual
> @@ -107,8 +107,8 @@ test_expect_success EXPENSIVE 'get bloom filter for commit with 513 changes' '
>  	git add bigDir &&
>  	git commit -m "commit with 513 changes" &&
>  	cat >expect <<-\EOF &&
> -	Filter_Length:0
> -	Filter_Data:
> +	Filter_Length:1
> +	Filter_Data:ff|
>  	EOF
>  	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>  	test_cmp expect actual

And these two tests now show the new behavior. Nice.

This version matches my expectations. Thanks.

-Stolee