Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/23/2020 8:38 AM, Jakub Narebski wrote:
> Garima Singh <garimasigit@xxxxxxxxx> writes:
>> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>>> From: Garima Singh <garima.singh@xxxxxxxxxxxxx>
>>>>
>>>> Add the core Bloom filter logic for computing the paths changed between a
>>>> commit and its first parent. For details on what Bloom filters are and how they
>>>> work, please refer to Dr. Derrick Stolee's blog post [1]. It provides a concise
>>>> explaination of the adoption of Bloom filters as described in [2] and [3].
>>>                                                                           ^^- to add
>>
>> Not sure what this means. Can you please clarify. 
>>
>>>> 1. We currently use 7 and 10 for the number of hashes and the size of each
>>>>    entry respectively. They served as great starting values, the mathematical
>>>>    details behind this choice are described in [1] and [4]. The implementation,
>>>                                                                                ^^- to add
>>
>> Not sure what this means. Can you please clarify.
> 
> I'm sorry for not being clear.  What I wanted to say that in both cases
> the last line should have ended in either full stop in first case, or
> comma in second case:
> 
>   "as described in [2] and [3]."
> 
>   "The implementation,"
> 
> What I wrote (trying to put the arrow below final fullstop or comma)
> only works when one is using with fixed-width font.
> 

Aah. Cool. Thanks! 

>>>> ---
>>>>  Makefile              |   2 +
>>>>  bloom.c               | 228 ++++++++++++++++++++++++++++++++++++++++++
>>>>  bloom.h               |  56 +++++++++++
>>>>  t/helper/test-bloom.c |  84 ++++++++++++++++
>>>>  t/helper/test-tool.c  |   1 +
>>>>  t/helper/test-tool.h  |   1 +
>>>>  t/t0095-bloom.sh      | 113 +++++++++++++++++++++
>>>>  7 files changed, 485 insertions(+)
>>>>  create mode 100644 bloom.c
>>>>  create mode 100644 bloom.h
>>>>  create mode 100644 t/helper/test-bloom.c
>>>>  create mode 100755 t/t0095-bloom.sh
>>>
>>> As I wrote earlier, In my opinion this patch could be split into three
>>> individual single-functionality pieces, to make it easier to review and
>>> aid in bisectability if needed.
>>
>> Doing this in v3. 
> 
> Thanks.  Though if it makes (much) more work for you, I can work with
> unsplit patch, no problem.
> 

Thanks! That's great! Splitting the patches will add some overhead. I will 
try and do it provided it does not delay getting v3 on the list. 

>>>> +
>>>> +static uint32_t rotate_right(uint32_t value, int32_t count)
>>>> +{
>>>> +	uint32_t mask = 8 * sizeof(uint32_t) - 1;
>>>> +	count &= mask;
>>>> +	return ((value >> count) | (value << ((-count) & mask)));
>>>> +}
>>>
>>> Hmmm... both the algoritm on Wikipedia, and reference implementation use
>>> rotate *left*, not rotate *right* in the implementation of Murmur3 hash,
>>> see
>>>
>>>   https://en.wikipedia.org/wiki/MurmurHash#Algorithm
>>>   https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp#L23
>>>
>>>
>>> inline uint32_t rotl32 ( uint32_t x, int8_t r )
>>> {
>>>   return (x << r) | (x >> (32 - r));
>>> }
>>
>> Thanks! Fixed this in v3. More on it later. 
> 
> Sidenote: If I understand it correctly Bloom filters functionality is
> included in Scalar [1].  What will happen then with all those Bloom
> filter chunks in commit-graph files with wrong hash functions?
> 
> [1]: https://devblogs.microsoft.com/devops/introducing-scalar/
> 

It is not included in Scalar. Scalar will write to the commit-graph in 
the background using the features available in the git version it is working
with. It will update to include changed path Bloom filters when they are 
available in git. We are not taking the Bloom filter into microsoft/git 
until the format is approved and accepted by the core git community.

>>>> +{
>>>> +	const uint32_t c1 = 0xcc9e2d51;
>>>> +	const uint32_t c2 = 0x1b873593;
>>>> +	const uint32_t r1 = 15;
>>>> +	const uint32_t r2 = 13;
>>>> +	const uint32_t m = 5;
>>>> +	const uint32_t n = 0xe6546b64;
>>>> +	int i;
>>>> +	uint32_t k1 = 0;
>>>> +	const char *tail;
>>>> +
>>>> +	int len4 = len / sizeof(uint32_t);
>>>> +
>>>> +	const uint32_t *blocks = (const uint32_t*)data;
>>>> +
>>>> +	uint32_t k;
>>>> +	for (i = 0; i < len4; i++)
>>>> +	{
>>>> +		k = blocks[i];
>>>
>>> IMPORTANT: There is a comment around there in the example implementation
>>> in C on Wikipedia that this operation above is a source of differing
>>> results across endianness.  
>>
>> Thanks! SZEDER found this on his CI pipeline and we have fixed it to 
>> process the data in 1 byte words to avoid hitting any endian-ness issues. 
>> See this part of the thread that carries the fix and the related discussion. 
>>   https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@xxxxxxxxx/
>> I will be squashing those changes in appropriately in v3.  
> 
> [...]
>>>> +		k1 *= c2;
>>>> +		seed ^= k1;
>>>> +		break;
>>>> +	}
>>>> +
>>>> +	seed ^= (uint32_t)len;
>>>> +	seed ^= (seed >> 16);
>>>> +	seed *= 0x85ebca6b;
>>>> +	seed ^= (seed >> 13);
>>>> +	seed *= 0xc2b2ae35;
>>>> +	seed ^= (seed >> 16);
>>>> +
>>>> +	return seed;
>>>> +}
>>>
>>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@xxxxxxxxx/
>>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>>> This may avoid the Big-endian vs Little-endian confusion,
>>> that is wrong results on Big-endian architectures, but
>>> it also may slow down the algorithm.
>>
>> Oh cool! You have seen that patch. And yes, we understand that it might add 
>> a little overhead but at this point it is more important to be correct on all
>> architectures instead of micro-optimizing and introducing different 
>> implementations for Little-endian and Big-endian. This would make this 
>> series overly complicated. Optimizing the hashing techniques would deserve a
>> series of its own, which we can definitely revisit later.
> 
> Right, "first make it work, then make it right, and, finally, make it fast.".
> 
> Anyway, could you maybe compare performance of Git for old version
> (operating on 32-bit/4-bytes words) and new version (operating on 1-byte
> words) file history operation with Bloom filters, to see if it matters
> or not?
> 

We chose to switch to 1 byte words for correctness, not performance. 
Also, this specific implementation choice is a very small portion of the 
end to end time spent computing and writing Bloom filters. We run two murmur3 
hashes per path, which is one path per `git log` query; and one path per change 
after parsing trees to compute a diff. Measuring performance and micro-optimizing 
is not worth the effort and/or trading in the simplicity here.


>>>> +
>>>> +struct bloom_filter *get_bloom_filter(struct repository *r,
>>>> +				      struct commit *c)
>>>> +{
>>>> +	struct bloom_filter *filter;
>>>> +	struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS;
>>>> +	int i;
>>>> +	struct diff_options diffopt;
>>>> +
>>>> +	if (!bloom_filters.slab_size)
>>>> +		return NULL;
>>>
>>> This is testing that commit slab for per-commit Bloom filters is
>>> initialized, isn't it?
>>>
>>> First, should we write the condition as
>>>
>>> 	if (!bloom_filters.slab_size)
>>>
>>> or would the following be more readable
>>>
>>> 	if (bloom_filters.slab_size == 0)
>>>
>>
>> Sure. Switched to `if (bloom_filter.slab_size == 0)` in v3. 
> 
> Though either works, and the former looks more like the test if
> bloom_filters slab are initialized, now that I thought about it a bit.
> Your choice.
> 

:) 


>>>> +
>>>> +	if (diff_queued_diff.nr <= 512) {
>>>
>>> Second, there is a minor issue that diff_queue_struct.nr stores the
>>> number of filepairs, that is the number of changed files, while the
>>> number of elements added to Bloom filter is number of changed blobs and
>>> trees.  For example if the following files are changed:
>>>
>>>   sub/dir/file1
>>>   sub/file2
>>>
>>> then diff_queued_diff.nr is 2, but number of elements to be added to
>>> Bloom filter is 4.
>>>
>>>   sub/dir/file1
>>>   sub/file2
>>>   sub/dir/
>>>   sub/
>>>
>>> I'm not sure if it matters in practice.
>>>
>>
>> It does not matter much in practice, since the directories usually tend
>> to collapse across the changes. Still, I will add another limit after 
>> creating the hashmap entries to cap at 640 so that we have a maximum of 
>> 100 changes in the bloom filter. 
>>
>> We plan to make these values configurable later. 
> 
> I'm not sure if it is truly necessary; we can treat limit on number of
> changed paths as "best effort" limit on Bloom filter size.
> 
> I just wanted to point out the difference.
> 

Sure. Not doing this for v3. Glad it got discussed here though!  

> 
> Side note: I wonder if it would be worth it (in the future) to change
> handling commits with large amount of changes.  I was thinking about
> switching to soft and hard limit: soft limit would be on the size of the
> Bloom filter, that is if number of elements times bits per element is
> greater that size threshold, we don't increase the size of the filter.
> 
> This would mean that the false positives ratio (the number of files that
> are not present but get answer "maybe" instead of "no" out of the
> filter) would increase, so there would be a need for another hard limit
> where we decide that it is not worth it, and not store the data for the
> Bloom filter -- current "no data" case with empty filter with length 0.
> This hard limit can be imposed on number of changed files, or on number
> of paths added to filter, or on number of bits set to 1 in the filter
> (on popcount), or some combination thereof.
> 
> [...]

Could be considered in the future. Doesn't make the cut for the current
series though. 

Thanks
Garima Singh



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux