Re: test-tool: bloom: generate_filter for multiple string?

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 31 Dec 2020 06:31:52 -0500

On 12/30/2020 10:54 PM, Đoàn Trần Công Danh wrote:
> 
> Hello,
> 
> I'm reading the code for Bloom Filter to see if arXiv:2012.00472
> could be an improvement.
> 
> I'm not sure if I'm missing something or it's our test-bloom generate_filter
> doesn't really support testing for multiple inputs.

It doesn't support creating a _realistic_ filter of the given size. The tests
at this level are more about keeping the filters consistent so we can trust
that we are not accidentally changing the hashing algorithm and its file format.

The test works if we provide multiple inputs, the problem is that the resulting
filter has a higher density of bits than we expect, since we didn't size it
according to the number of inputs.

> If I understand correctly, we should either:
> * allocate more entry for inputs; or
> * abort if argc != 3

Either approach would be fine. I wonder what your goal is for testing the
multiple inputs. Are you expecting a larger filter to demonstrate some
behavior that is not tested by a small filter?

> diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c
> index 46e97b04eb..1026facc59 100644
> --- a/t/helper/test-bloom.c
> +++ b/t/helper/test-bloom.c
> @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv)
>  	if (!strcmp(argv[1], "generate_filter")) {
>  		struct bloom_filter filter;
>  		int i = 2;
> -		filter.len =  (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD;
> -		filter.data = xcalloc(filter.len, sizeof(unsigned char));
>  
>  		if (argc - 1 < i)
>  			usage(bloom_usage);
>  
> +		filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1)
> +			/ BITS_PER_WORD;
> +		filter.data = xcalloc(filter.len, sizeof(unsigned char));
> +

Whitespace issues aside, this is the right approach to make the filter grow with
the input.

>  		while (argv[i]) {
>  			add_string_to_filter(argv[i], &filter);
>  			i++;
> diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh
> index 7e4ab1795f..6d83927c86 100755
> --- a/t/t0095-bloom.sh
> +++ b/t/t0095-bloom.sh
> @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'compute bloom key for multiple string' '
> +	cat >expect <<-\EOF &&
> +	Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d|
> +	Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585|

Multiple hash outputs work already, helping us build confidence that our
hash algorithm hasn't changed.

> +	Filter_Length:3
> +	Filter_Data:45|ba|ac|

And this is the part of the output that you are changing.

> +	EOF
> +	test-tool bloom generate_filter "Hello world!" file.txt >actual &&
> +	test_cmp expect actual
> +'
> +

So, your suggested change has merit. I just wonder what value it provides on top
of the current implementation? If your goal is to create a way to inspect a full-sized
filter, then that would be interesting to explore.

Thanks,
-Stolee