Garima Singh <garimasigit@xxxxxxxxx> writes: > On 2/23/2020 8:38 AM, Jakub Narebski wrote: >> Garima Singh <garimasigit@xxxxxxxxx> writes: >>> On 2/16/2020 11:49 AM, Jakub Narebski wrote: >>>>> From: Garima Singh <garima.singh@xxxxxxxxxxxxx> [...] >>>> IMPORTANT: There is a comment around there in the example implementation >>>> in C on Wikipedia that this operation above is a source of differing >>>> results across endianness. >>> >>> Thanks! SZEDER found this on his CI pipeline and we have fixed it to >>> process the data in 1 byte words to avoid hitting any endian-ness issues. >>> See this part of the thread that carries the fix and the related discussion. >>> https://lore.kernel.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@xxxxxxxxx/ >>> I will be squashing those changes in appropriately in v3. >> >> [...] >>>> >>>> In https://public-inbox.org/git/ba856e20-0a3c-e2d2-6744-b9abfacdc465@xxxxxxxxx/ >>>> you posted "[PATCH] Process bloom filter data as 1 byte words". >>>> This may avoid the Big-endian vs Little-endian confusion, >>>> that is wrong results on Big-endian architectures, but >>>> it also may slow down the algorithm. >>> >>> Oh cool! You have seen that patch. And yes, we understand that it might add >>> a little overhead but at this point it is more important to be correct on all >>> architectures instead of micro-optimizing and introducing different >>> implementations for Little-endian and Big-endian. This would make this >>> series overly complicated. Optimizing the hashing techniques would deserve a >>> series of its own, which we can definitely revisit later. >> >> Right, "first make it work, then make it right, and, finally, make it fast.". >> >> Anyway, could you maybe compare performance of Git for old version >> (operating on 32-bit/4-bytes words) and new version (operating on 1-byte >> words) file history operation with Bloom filters, to see if it matters >> or not? >> > > We chose to switch to 1 byte words for correctness, not performance. > Also, this specific implementation choice is a very small portion of the > end to end time spent computing and writing Bloom filters. We run two murmur3 > hashes per path, which is one path per `git log` query; and one path per change > after parsing trees to compute a diff. Measuring performance and micro-optimizing > is not worth the effort and/or trading in the simplicity here. All right. I still think that adding to_le32() invocation before the part that processes remaining bytes (the 'switch' instruction in v2 code), just like in pseudo-code on Wikipedia: with any remainingBytesInKey do remainingBytes ← SwapToLittleEndian(remainingBytesInKey) would be enough to have correct results regardlless of endianness. As I wrote JN> The solution in PMurHash.c in Chromium [1], and the pseudo-code algorithm on JN> Wikipedia do endian handling only for remaining bytes (while the JN> beginnings of solution in Appleby's code, and solution in current JN> above-mentioned Chromium implementation do the conversion for all JN> bytes). I think that handling it only for remaining bytes (for data JN> sizes not being multiply of 32-bits / 4-bytes) is enough; all other JN> operations, that is multiply, rotate, xor and addition do not depend on JN> endianness. [1]: https://chromium.googlesource.com/external/smhasher/+/5b8fd3c31a58b87b80605dca7a64fad6cb3f8a0f/PMurHash.c If you have access to, or can run code on some big-endian architecture, it should be easy enough to check it. Anyway, if you decide on 1-byte at time implementation, please put a comment about 32-bit chunk implementation. >> Side note: I wonder if it would be worth it (in the future) to change >> handling commits with large amount of changes. I was thinking about >> switching to soft and hard limit: soft limit would be on the size of the >> Bloom filter, that is if number of elements times bits per element is >> greater that size threshold, we don't increase the size of the filter. >> >> This would mean that the false positives ratio (the number of files that >> are not present but get answer "maybe" instead of "no" out of the >> filter) would increase, so there would be a need for another hard limit >> where we decide that it is not worth it, and not store the data for the >> Bloom filter -- current "no data" case with empty filter with length 0. >> This hard limit can be imposed on number of changed files, or on number >> of paths added to filter, or on number of bits set to 1 in the filter >> (on popcount), or some combination thereof. >> >> [...] > > Could be considered in the future. Doesn't make the cut for the current > series though. Right. Best, -- Jakub Narębski