Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Garima Singh <garimasigit@xxxxxxxxx> writes:
> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>> From: Garima Singh <garima.singh@xxxxxxxxxxxxx>
>>>
>>> Add the core Bloom filter logic for computing the paths changed between a
>>> commit and its first parent. For details on what Bloom filters are and how they
>>> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
>>> explaination of the adoption of Bloom filters as described in [2] and [3].
>>                                                                           ^^- to add
>
> Not sure what this means. Can you please clarify. 
>
>>> 1. We currently use 7 and 10 for the number of hashes and the size of each
>>>    entry respectively. They served as great starting values, the mathematical
>>>    details behind this choice are described in [1] and [4]. The implementation,
>>                                                                                ^^- to add
>
> Not sure what this means. Can you please clarify.

I'm sorry for not being clear.  What I wanted to say that in both cases
the last line should have ended in either full stop in first case, or
comma in second case:

  "as described in [2] and [3]."

  "The implementation,"

What I wrote (trying to put the arrow below final fullstop or comma)
only works when one is using with fixed-width font.

>>> 3. The filters are sized according to the number of changes in the each commit,
>>>    with minimum size of one 64 bit word.

[...]
>> The interesting corner case, which might be worth specifying explicitly,
>> is what happens in the case there are _no changes_ with respect to first
>> parent (which can happen with either commit created with `git commit
>> --allow-empty`, or merge created e.g. with `git merge --strategy=ours`).
>> Is this case represented as Bloom filter of length 0, or as a Bloom
>> filter of length  of one 64-bit word which is minimal length composed of
>> all 0's (0x0000000000000000)?
>> 
>
> See t0095-bloom.sh: The filter for a commit with no changes is of length 0.
> I will call it out specifically in the appropriate commit message as well. 

I have realized this only later that both "no changes" and "no data"
uses filter of length 0; which works well because checking the diff if
there were no changes is cheap (both tree oids are the same).

>>> ---
>>>  Makefile              |   2 +
>>>  bloom.c               | 228 ++++++++++++++++++++++++++++++++++++++++++
>>>  bloom.h               |  56 +++++++++++
>>>  t/helper/test-bloom.c |  84 ++++++++++++++++
>>>  t/helper/test-tool.c  |   1 +
>>>  t/helper/test-tool.h  |   1 +
>>>  t/t0095-bloom.sh      | 113 +++++++++++++++++++++
>>>  7 files changed, 485 insertions(+)
>>>  create mode 100644 bloom.c
>>>  create mode 100644 bloom.h
>>>  create mode 100644 t/helper/test-bloom.c
>>>  create mode 100755 t/t0095-bloom.sh
>> 
>> As I wrote earlier, In my opinion this patch could be split into three
>> individual single-functionality pieces, to make it easier to review and
>> aid in bisectability if needed.
>
> Doing this in v3. 

Thanks.  Though if it makes (much) more work for you, I can work with
unsplit patch, no problem.

>>> +
>>> +static uint32_t rotate_right(uint32_t value, int32_t count)
>>> +{
>>> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
>>> +	count &= mask;
>>> +	return ((value >> count) | (value << ((-count) & mask)));
>>> +}
>> 
>> Hmmm... both the algoritm on Wikipedia, and reference implementation use
>> rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
>> see
>> 
>>   https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>   https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
>> 
>> 
>> inline uint32_t rotl32 ( uint32_t x, int8_t r )
>> {
>>   return (x << r) | (x >> (32 - r));
>> }
>
> Thanks! Fixed this in v3. More on it later. 

Sidenote: If I understand it correctly Bloom filters functionality is
included in Scalar [1].  What will happen then with all those Bloom
filter chunks in commit-graph files with wrong hash functions?

[1]: https://devblogs.microsoft.com/devops/introducing-scalar/

>>> +
>>> +/*
>>> + * Calculate a hash value for the given data using the given seed.
>>> + * Produces a uniformly distributed hash value.
>>> + * Not considered to be cryptographically secure.
>>> + * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>> + **/
>>     ^^-- why two _trailing_ asterisks?
>
> Oops. Fixed. 

Often two _leading_ asterisks are used to mark commit as containing
docstring in some specific format, like Doxygen.  Two _trailing_
asterisks looks like typo.

>>> +static uint32_t seed_murmur3(uint32_t seed, const char *data, int len)
>> 
>> In short, I think that the name of the function should be murmur3_32, or
>> murmurhash3_32, or possibly murmur3_32_seed, or something like that.
>
> Renamed it to murmur3_seeded in v3. The input and output types in the 
> signature make it clear that it is 32-bit version.

All right, I can agree with that.

>>> +{
>>> +	const uint32_t c1 = 0xcc9e2d51;
>>> +	const uint32_t c2 = 0x1b873593;
>>> +	const uint32_t r1 = 15;
>>> +	const uint32_t r2 = 13;
>>> +	const uint32_t m = 5;
>>> +	const uint32_t n = 0xe6546b64;
>>> +	int i;
>>> +	uint32_t k1 = 0;
>>> +	const char *tail;
>>> +
>>> +	int len4 = len / sizeof(uint32_t);
>>> +
>>> +	const uint32_t *blocks = (const uint32_t*)data;
>>> +
>>> +	uint32_t k;
>>> +	for (i = 0; i < len4; i++)
>>> +	{
>>> +		k = blocks[i];
>> 
>> IMPORTANT: There is a comment around there in the example implementation
>> in C on Wikipedia that this operation above is a source of differing
>> results across endianness.  
>
> Thanks! SZEDER found this on his CI pipeline and we have fixed it to 
> process the data in 1 byte words to avoid hitting any endian-ness issues. 
> See this part of the thread that carries the fix and the related discussion. 
>   https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@xxxxxxxxx/
> I will be squashing those changes in appropriately in v3.  

[...]
>>> +		k1 *= c2;
>>> +		seed ^= k1;
>>> +		break;
>>> +	}
>>> +
>>> +	seed ^= (uint32_t)len;
>>> +	seed ^= (seed >> 16);
>>> +	seed *= 0x85ebca6b;
>>> +	seed ^= (seed >> 13);
>>> +	seed *= 0xc2b2ae35;
>>> +	seed ^= (seed >> 16);
>>> +
>>> +	return seed;
>>> +}
>> 
>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@xxxxxxxxx/
>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>> This may avoid the Big-endian vs Little-endian confusion,
>> that is wrong results on Big-endian architectures, but
>> it also may slow down the algorithm.
>
> Oh cool! You have seen that patch. And yes, we understand that it might add 
> a little overhead but at this point it is more important to be correct on all
> architectures instead of micro-optimizing and introducing different 
> implementations for Little-endian and Big-endian. This would make this 
> series overly complicated. Optimizing the hashing techniques would deserve a
> series of its own, which we can definitely revisit later.

Right, "first make it work, then make it right, and, finally, make it fast.".

Anyway, could you maybe compare performance of Git for old version
(operating on 32-bit/4-bytes words) and new version (operating on 1-byte
words) file history operation with Bloom filters, to see if it matters
or not?

>> The public domain implementation in PMurHash.c in SMHasher
>> (re)implementation in Chromium (see URL above) fall backs to 1-byte
>> operations only if it doesn't know the endianness (or if it is neither
>> little-endian, nor big-endian, i.e. middle-endian or mixed-endian --
>> though I doubt that Git works correctly on mixed-endian anyway).
>> 
>> 
>> Sidenote: it looks like the current implementation if Murmur hash in
>> Chromium uses MurmurHash3_x86_32, i.e. little-endian unaligned-safe
>> implementation, but prepares data by swapping with StringToLE32
>> https://github.com/chromium/chromium/blob/master/components/variations/variations_murmur_hash.h

The solution in PMurHash.c in Chromium, and the pseudo-code algorithm on
Wikipedia do endian handling only for remaining bytes (while the
solution in Appleby's code [beginnings of], and in current
above-mentioned Chromium implementation do the conversion for all
bytes).  I think that handling it only for remaining bytes (for data
sizes not being multiply of 32-bits / 4-bytes) is enough; all other
operations, that is multiply, rotate, xor and addition do not depend on
endianness.

>> Assuming that the terminating NUL ("\0") character of a c-string is not
>> included in hash calculations, then murmur3_x86_32 hash has the
>> following results (all results are for seed equal 0):
>> 
>> ''               -> 0x00000000
>> ' '              -> 0x7ef49b98
>> 'Hello world!'   -> 0x627b0c2c
>> 'The quick brown fox jumps over the lazy dog'   -> 0x2e4ff723
>> 
>> C source (from Wikipedia): https://godbolt.org/z/ofa2p8
>> C++ source (Appleby's):    https://godbolt.org/z/BoSt6V
>> 
>> The implementation provided in this patch, with rotate_right (instead of
>> rotate_left) gives, on little-endian machine, different results:
>> 
>> ''               -> 0x00000000
>> ' '              -> 0xd1f27e64
>> 'Hello world!'   -> 0xa0791ad7
>> 'The quick brown fox jumps over the lazy dog'   -> 0x99f1676c
>> 
>> https://github.com/gitgitgadget/git/blob/e1b076a714d611e59d3d71c89221e41a3427fae4/bloom.c#L21
>> C source (via GitGitGadget): https://godbolt.org/z/R9s8Tt
>> 
>
> Thanks! This is an excellent catch! Fixing the rotate_right to rotate_left, 
> gives us the same answers as the two implementations you pointed out. I have
> added the appropriate unit tests in v3 and they match the values you obtained 
> from the other implementations. Thanks a lot for the rigor! 
>
> We based our implementation on the pseudo code and not on the sample code 
> presented here: https://en.wikipedia.org/wiki/MurmurHash#Algorithm
> We just didn't parse the ROL instruction correctly. 

All right, that's good.

Note that the pseudo code includes the following:

    with any remainingBytesInKey do
        remainingBytes ← SwapToLittleEndian(remainingBytesInKey)
        // Note: Endian swapping is only necessary on big-endian machines.
        //       The purpose is to place the meaningful digits towards the low end of the value,
        //       so that these digits have the greatest potential to affect the low range digits
        //       in the subsequent multiplication.  Consider that locating the meaningful digits
        //       in the high range would produce a greater effect upon the high digits of the
        //       multiplication, and notably, that such high digits are likely to be discarded
        //       by the modulo arithmetic under overflow.  We don't want that.

[...]
>>> +{
>>> +	int i;
>>> +	const uint32_t seed0 = 0x293ae76f;
>>> +	const uint32_t seed1 = 0x7e646e2c;
>> 
>> Where did those seeds values came from?
>>
>
> Those values were chosen randomly. They will be fixed constants for the 
> current hashing version. I will add a note calling this out in the 
> appropriate commit messages and the Documentation in v3. 

Nice to know.

I wonder if those seed values should be relatively prime, and whether
seed1 should be odd (from theoretical point of view).

>>> +	const uint32_t hash0 = seed_murmur3(seed0, data, len);
>>> +	const uint32_t hash1 = seed_murmur3(seed1, data, len);
>>> +
>>> +	key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t));
>>> +	for (i = 0; i < settings->num_hashes; i++)
>>> +		key->hashes[i] = hash0 + i * hash1;
>> 
>> Note that in [3] authors say that double hashing technique has some
>> problems.  For one, we should ensure that hash1 is not zero, and even
>> better that it is odd (which makes it relatively prime to filter size
>> which is multiple of 64).  It also suffers from something called
>> "approximate fingerprint collisions".
>> 
>> That is why the define "enhanced double hashing" technique, which does
>> not suffer from those problems (Algorithm 2, page 11/15).
>> 
>>   +	for (i = 0; i < settings->num_hashes; i++) {
>>   +		key->hashes[i] = hash0;
>>   +
>>   +		hash0 = hash0 + hash1;
>>   +		hash1 = hash1 + i;
>>   +	}
>> 
>> This can also be written in closed form, based on equation (6)
>> 
>>   +	for (i = 0; i < settings->num_hashes; i++)
>>   +		key->hashes[i] = hash0 + i * hash1 + i*(i*i - 1)/6;
>> 
>> 
>> In later paper [6] the closed form for "enhanced double hashing"
>> (p. 188) is slightly modified (or rather they use different variant of
>> this technique):
>> 
>>   +	for (i = 0; i < settings->num_hashes; i++)
>>   +		key->hashes[i] = hash0 + i * hash1 + i*i;
>> 
>> This is a variant of more generic "enhanced double hashing", section
>> 5.2 (Enhanced) Double Hashing Schemes (page 199):
>> 
>>         h_1(u) + i h_2(u) + f(i)    mod m
>> 
>> with f(i) = i^2 = i*i.
>> 
>> They have tested that enhanced double hashing with both f(i) equal i*i
>> and equal i*i*i, and triple hashing technique, and they have found that
>> it performs slightly better than straight double hashing technique
>> (Fig. 1, page 212, section 3).
>> 
>
> Thanks for the detailed research here! The hash becoming zero and the 
> approximate fingerprint collision are both extremely rare situations. In both
> cases, we would just see `git log` having to diff more trees than if it didn't 
> occur. While these techniques would be great optimizations to do, especially
> if this implementation gets pulled into more generic hashing applications
> in the code, we think that for the purposes of the current series - it is not 
> worth it. I say this because Azure Repos has been using this exact hashing 
> technique for several years now without any glitches. And we think it would
> be great to rely on this battle tested strategy in at least the first version
> of this feature. 

All right, that is a good strategy.

I wonder if switching from double hashing to enhanced double hashing
(for example the variant with i*i added) would bring any noticeable
performance improvements in Git operations (due to less false
positives).

>>> +
>>> +struct bloom_filter *get_bloom_filter(struct repository *r,
>>> +				      struct commit *c)
>>> +{
>>> +	struct bloom_filter *filter;
>>> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>> +	int i;
>>> +	struct diff_options diffopt;
>>> +
>>> +	if (!bloom_filters.slab_size)
>>> +		return NULL;
>> 
>> This is testing that commit slab for per-commit Bloom filters is
>> initialized, isn't it?
>> 
>> First, should we write the condition as
>> 
>> 	if (!bloom_filters.slab_size)
>> 
>> or would the following be more readable
>> 
>> 	if (bloom_filters.slab_size == 0)
>> 
>
> Sure. Switched to `if (bloom_filter.slab_size == 0)` in v3. 

Though either works, and the former looks more like the test if
bloom_filters slab are initialized, now that I thought about it a bit.
Your choice.

>> Second, should we return NULL, or should we just initialize the slab?
>> Or is non-existence of slab treated as a signal that the Bloom filters
>> mechanism is turned off?
>> 
>
> Yes. We purposefully choose to return NULL and ignore the mechanism 
> overall because we use Bloom filters best effort only. 

All right.

>>> +
>>> +	if (diff_queued_diff.nr <= 512) {
>>
>> Second, there is a minor issue that diff_queue_struct.nr stores the
>> number of filepairs, that is the number of changed files, while the
>> number of elements added to Bloom filter is number of changed blobs and
>> trees.  For example if the following files are changed:
>> 
>>   sub/dir/file1
>>   sub/file2
>> 
>> then diff_queued_diff.nr is 2, but number of elements to be added to
>> Bloom filter is 4.
>> 
>>   sub/dir/file1
>>   sub/file2
>>   sub/dir/
>>   sub/
>> 
>> I'm not sure if it matters in practice.
>> 
>
> It does not matter much in practice, since the directories usually tend
> to collapse across the changes. Still, I will add another limit after 
> creating the hashmap entries to cap at 640 so that we have a maximum of 
> 100 changes in the bloom filter. 
>
> We plan to make these values configurable later. 

I'm not sure if it is truly necessary; we can treat limit on number of
changed paths as "best effort" limit on Bloom filter size.

I just wanted to point out the difference.


Side note: I wonder if it would be worth it (in the future) to change
handling commits with large amount of changes.  I was thinking about
switching to soft and hard limit: soft limit would be on the size of the
Bloom filter, that is if number of elements times bits per element is
greater that size threshold, we don't increase the size of the filter.

This would mean that the false positives ratio (the number of files that
are not present but get answer "maybe" instead of "no" out of the
filter) would increase, so there would be a need for another hard limit
where we decide that it is not worth it, and not store the data for the
Bloom filter -- current "no data" case with empty filter with length 0.
This hard limit can be imposed on number of changed files, or on number
of paths added to filter, or on number of bits set to 1 in the filter
(on popcount), or some combination thereof.

[...]
>>> +
>>> +		for (i = 0; i < diff_queued_diff.nr; i++) {
>>> +			const char* path = diff_queued_diff.queue[i]->two->path;
>> 
>> Is that correct that we consider only post-image name for storing
>> changes in Bloom filter?  Currently if file was renamed (or deleted), it
>> is considered changed, and `git log -- <old-name>` lists commit that
>> changed file name too.
>
> The tests in t4216-log-bloom.sh ensure that the output of `git log -- <oldname>` 
> remains unchanged for renamed and deleted files, when using bloom filters. 
> I realize that I fat fingered over checking the old name, and didn't have an 
> explicit deleted file in the test. I have added them in v3, and the tests pass. 
> So the behavior is preserved and as expected when using Bloom filters. 
> Thanks for paying close attention! 

It seems like it shouldn't be working, as we are not adding the old name
to Bloom filter, but that only means that I misunderstood how
diff_tree_oid() works with default options.  It turns out that without
explicitly turning on rename detection it shows rename as deletion of
old name and addition of new name -- so if tracking deletion works
correctly, then tracking renames should work correctly.

So it is in fact correct, which as you said was confirmed by (improved)
tests.  I think also that if there was a bug in handling renames in this
code it would have been detected when running CI with
GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS.

[...]
>>> +		filter->data = NULL;
>>> +		filter->len = 0;
>> 
>> This needs to be explicitly stated both in the commit message and in the
>> API documentation (in comments) that bloom_filter.len == 0 means "no
>> data", while "no changes" is represented as bloom_filter with len == 1
>> and *data == (uint64_t)0;
>> 
>> EDIT: actually "no changes" is also represented as bloom_filter with len
>> equal 0, as it turns out.
>> 
>> One possible alternative could be representing "no data" value with
>> Bloom filter of length 1 and all 64 bits set to 1, and "no changes"
>> represented as filter of length 0.  This is not unambiguous choice!
>>
>
> There is no gain in distinguishing between the absence of a filter and
> a commit having no changes. The effect on `git log -- path` is the same in 
> both cases. We fall back to the normal diffing algorithm in revision.c.
> I will make this clearer in the appropriate commit messages and in the 
> Documentation in v3. 

You are right, which I have realized only when reviewing subsequent
patches in the series.

In the absence of a filter, the "no data" case, we need to fall back to
examining the diff anyway.

In the case of commit having no changes, the "no changes" case,
computing the diff is cheap because Git can realize that both trees have
the same oid.  So we do not lose performance this way, and we avoid
special-casing it (avoiding branching) when computing the Bloom filter,
if the "no change" case was represented by filter of length 1 and all
zero bits as data.  Comparing tree oids and matching first hash function
in bloom_key against all zeros Bloom filter should be, I think, of
similar performance.

[...]
>>> +. ./test-lib.sh
>>> +
>>> +test_expect_success 'get bloom filters for commit with no changes' '
>>> +	git init &&
>>> +	git commit --allow-empty -m "c0" &&
>>> +	cat >expect <<-\EOF &&
>>> +	Filter_Length:0
>>> +	Filter_Data:
>>> +	EOF
>>> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>>> +	test_cmp expect actual
>>> +'
>> 
>> A few things.  First, I wonder why we need to provide object ID;
>> couldn't 'test-tool bloom get_filter_for_commit' parse commit-ish
>> argument, or would it make it too complicated for no reason? 
>
> Yes it was overkill for what I need in the test. 

All right, I agree with that.

>>> +
>>> +test_expect_success 'get bloom filter for commit with 10 changes' '
>>> +	rm actual &&
>>> +	rm expect &&
>>> +	mkdir smallDir &&
>>> +	for i in $(test_seq 0 9)
>>> +	do
>>> +		echo $i >smallDir/$i
>>> +	done &&
>>> +	git add smallDir &&
>>> +	git commit -m "commit with 10 changes" &&
>>> +	cat >expect <<-\EOF &&
>>> +	Filter_Length:4
>>> +	Filter_Data:508928809087080a|8a7648210804001|4089824400951000|841ab310098051a8|
>>> +	EOF
>>> +	test-tool bloom get_filter_for_commit "$(git rev-parse HEAD)" >actual &&
>>> +	test_cmp expect actual
>>> +'
>> 
>> This test is in my opinion fragile, as it unnecessarily test the
>> implementation details instead of the functionality provided.  If we
>> change the hashing scheme (for example going from double hashing to some
>> variant of enhanced double hashing), or change the base hash function
>> (for example from Murmur3_32 to xxHash_64), or change the number of hash
>> functions (perhaps because changing of number of bits per element, and
>> thus optimal number of hash functions from 7 to 6), or change from
>> 64-bit word blocks to 32-bit word blocks, the test would have to be
>> changed.
>
> Regarding this and the rest of you comments on t0095-log-bloom.sh:
>
> I am tweaking it as necessary but the entire point of these tests is to
> break for the things you called out. They need to be intricately tied
> to the current hashing strategy and are hence intended to be fragile so 
> as to catch any subtle or accidental changes in the hashing computation. 
> Any change like the ones you have called out would require a hash version
> change and all the compatibility reactions that come with it. 

All right, if we assume that commit-graph is not something purely local^*,
and we need iteroperability, then this test is necessary and is
necessarily fragile.

*. This may happen because the repository and the commit-graph file in
   it is on network disk, and accessed by hosts with different
   endianness.  Or in the future (or possibly now, if one is using
   Scalar) the commit-graph file can be sent together with packfile
   during the fetch operation.

On the other hand testing the functionality of Murmur hash, and of Bloom
filter would help finding possible troubles if we decide in the future
to change the algorithm details (change hash function, and/or move from
double hashing to enhanced double hashing, and/or change how commits
with large number of changes are handled, or even switching to xor
filters [1]).

[1]: Graf, Thomas Mueller; Lemire, Daniel (2019), "Xor Filters: Faster
and Smaller Than Bloom and Cuckoo Filters", https://arxiv.org/abs/1912.08258

> I have added more tests around the murmur3_seeded method in v3. Removed
> some of the redundant ones. 

There is another test that might be worth adding (see the comment below
why), namely one test checking that bloom_key is computed as expected.

> The other more evolved test cases you call out are covered in the e2e
> integration tests in t4216-log-bloom.sh

All right, but there is another issue to consider.  Good tests should
not only catch the breakage, but also help to detect where the bug is.
That is one of advantages that unit tests (like the ones I have
proposed) have over end-to-end functional tests.  They are also often
faster.

On the other hand e2e tests can catch problems with integration, and
actually check that the user-visible behaviour is as expected.


Best,
--
Jakub Narębski

>> 
>> Reviewed-by: Jakub Narębski <jnareb@xxxxxxxxx>
>> 
>> Thanks for working on this.
>> 
>> Best, 
>
> Thank you once again for an excellent and in-depth review of this patch! 
> You have helped make this code so much better!




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux