Re: [PATCH v2 02/11] bloom: core Bloom filter implementation for changed paths

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 24 Feb 2020 19:20:03 +0100

Garima Singh <garimasigit@xxxxxxxxx> writes:

> On 2/23/2020 8:38 AM, Jakub Narebski wrote:
>> Garima Singh <garimasigit@xxxxxxxxx> writes:
>>> On 2/16/2020 11:49 AM, Jakub Narebski wrote:
>>>>> From: Garima Singh <garima.singh@xxxxxxxxxxxxx>

[...]
>>>> IMPORTANT: There is a comment around there in the example implementation
>>>> in C on Wikipedia that this operation above is a source of differing
>>>> results across endianness.  
>>>
>>> Thanks! SZEDER found this on his CI pipeline and we have fixed it to 
>>> process the data in 1 byte words to avoid hitting any endian-ness issues. 
>>> See this part of the thread that carries the fix and the related discussion. 
>>>   https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@xxxxxxxxx/
>>> I will be squashing those changes in appropriately in v3.  
>> 
>> [...]
>>>>
>>>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@xxxxxxxxx/
>>>> you posted "[PATCH] Process bloom filter data as 1 byte words".
>>>> This may avoid the Big-endian vs Little-endian confusion,
>>>> that is wrong results on Big-endian architectures, but
>>>> it also may slow down the algorithm.
>>>
>>> Oh cool! You have seen that patch. And yes, we understand that it might add 
>>> a little overhead but at this point it is more important to be correct on all
>>> architectures instead of micro-optimizing and introducing different 
>>> implementations for Little-endian and Big-endian. This would make this 
>>> series overly complicated. Optimizing the hashing techniques would deserve a
>>> series of its own, which we can definitely revisit later.
>> 
>> Right, "first make it work, then make it right, and, finally, make it fast.".
>> 
>> Anyway, could you maybe compare performance of Git for old version
>> (operating on 32-bit/4-bytes words) and new version (operating on 1-byte
>> words) file history operation with Bloom filters, to see if it matters
>> or not?
>> 
>
> We chose to switch to 1 byte words for correctness, not performance. 
> Also, this specific implementation choice is a very small portion of the 
> end to end time spent computing and writing Bloom filters. We run two murmur3 
> hashes per path, which is one path per `git log` query; and one path per change 
> after parsing trees to compute a diff. Measuring performance and micro-optimizing 
> is not worth the effort and/or trading in the simplicity here.

All right.

I still think that adding to_le32() invocation before the part that
processes remaining bytes (the 'switch' instruction in v2 code), just
like in pseudo-code on Wikipedia:

    with any remainingBytesInKey do
        remainingBytes ← SwapToLittleEndian(remainingBytesInKey)

would be enough to have correct results regardlless of endianness.

As I wrote

JN> The solution in PMurHash.c in Chromium [1], and the pseudo-code algorithm on
JN> Wikipedia do endian handling only for remaining bytes (while the
JN> beginnings of solution in Appleby's code, and solution in current
JN> above-mentioned Chromium implementation do the conversion for all
JN> bytes).  I think that handling it only for remaining bytes (for data
JN> sizes not being multiply of 32-bits / 4-bytes) is enough; all other
JN> operations, that is multiply, rotate, xor and addition do not depend on
JN> endianness.

[1]: https://chromium.googlesource.com/external/smhasher/+/5b8fd3c31a58b87b80605dca7a64fad6cb3f8a0f/PMurHash.c

If you have access to, or can run code on some big-endian architecture,
it should be easy enough to check it.

Anyway, if you decide on 1-byte at time implementation, please put a
comment about 32-bit chunk implementation.

>> Side note: I wonder if it would be worth it (in the future) to change
>> handling commits with large amount of changes.  I was thinking about
>> switching to soft and hard limit: soft limit would be on the size of the
>> Bloom filter, that is if number of elements times bits per element is
>> greater that size threshold, we don't increase the size of the filter.
>> 
>> This would mean that the false positives ratio (the number of files that
>> are not present but get answer "maybe" instead of "no" out of the
>> filter) would increase, so there would be a need for another hard limit
>> where we decide that it is not worth it, and not store the data for the
>> Bloom filter -- current "no data" case with empty filter with length 0.
>> This hard limit can be imposed on number of changed files, or on number
>> of paths added to filter, or on number of bits set to 1 in the filter
>> (on popcount), or some combination thereof.
>> 
>> [...]
>
> Could be considered in the future. Doesn't make the cut for the current
> series though. 

Right.

Best,
-- 
Jakub Narębski