On 12/30/2020 10:54 PM, Đoàn Trần Công Danh wrote: > > Hello, > > I'm reading the code for Bloom Filter to see if arXiv:2012.00472 > could be an improvement. > > I'm not sure if I'm missing something or it's our test-bloom generate_filter > doesn't really support testing for multiple inputs. It doesn't support creating a _realistic_ filter of the given size. The tests at this level are more about keeping the filters consistent so we can trust that we are not accidentally changing the hashing algorithm and its file format. The test works if we provide multiple inputs, the problem is that the resulting filter has a higher density of bits than we expect, since we didn't size it according to the number of inputs. > If I understand correctly, we should either: > * allocate more entry for inputs; or > * abort if argc != 3 Either approach would be fine. I wonder what your goal is for testing the multiple inputs. Are you expecting a larger filter to demonstrate some behavior that is not tested by a small filter? > diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c > index 46e97b04eb..1026facc59 100644 > --- a/t/helper/test-bloom.c > +++ b/t/helper/test-bloom.c > @@ -68,12 +68,14 @@ int cmd__bloom(int argc, const char **argv) > if (!strcmp(argv[1], "generate_filter")) { > struct bloom_filter filter; > int i = 2; > - filter.len = (settings.bits_per_entry + BITS_PER_WORD - 1) / BITS_PER_WORD; > - filter.data = xcalloc(filter.len, sizeof(unsigned char)); > > if (argc - 1 < i) > usage(bloom_usage); > > + filter.len = (settings.bits_per_entry * (argc - 2) + BITS_PER_WORD - 1) > + / BITS_PER_WORD; > + filter.data = xcalloc(filter.len, sizeof(unsigned char)); > + Whitespace issues aside, this is the right approach to make the filter grow with the input. > while (argv[i]) { > add_string_to_filter(argv[i], &filter); > i++; > diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh > index 7e4ab1795f..6d83927c86 100755 > --- a/t/t0095-bloom.sh > +++ b/t/t0095-bloom.sh > @@ -67,6 +67,17 @@ test_expect_success 'compute bloom key for test string 2' ' > test_cmp expect actual > ' > > +test_expect_success 'compute bloom key for multiple string' ' > + cat >expect <<-\EOF && > + Hashes:0xb270de9b|0x1bb6f26e|0x84fd0641|0xee431a14|0x57892de7|0xc0cf41ba|0x2a15558d| > + Hashes:0x20ab385b|0xf5237fe2|0xc99bc769|0x9e140ef0|0x728c5677|0x47049dfe|0x1b7ce585| Multiple hash outputs work already, helping us build confidence that our hash algorithm hasn't changed. > + Filter_Length:3 > + Filter_Data:45|ba|ac| And this is the part of the output that you are changing. > + EOF > + test-tool bloom generate_filter "Hello world!" file.txt >actual && > + test_cmp expect actual > +' > + So, your suggested change has merit. I just wonder what value it provides on top of the current implementation? If your goal is to create a way to inspect a full-sized filter, then that would be interesting to explore. Thanks, -Stolee