Hi Sage, On 29/08/2016 15:55, Sage Weil wrote: > On Mon, 29 Aug 2016, Loic Dachary wrote: >> Hi, >> >> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is >> straightforward and would help with crushmap validation, is there any >> reason not to do it ? >> >> When resolving a crush rule (crush_do_rule in mapper.c), the straw2 >> function (bucket_straw2_choose) calls the hashing function >> (crush_hash32_3) for each item in a bucket and keeps the best match. >> When a bucket has four items, the hash function can be run using SIMD >> instructions. Each item value is 32 bits and four can fit in a __m128i. >> >> I tried to inline the hash function when the conditions are right[1] and >> run a test to measure the difference. >> >> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0 >> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4 >> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4 >> rule 0 (replicated_ruleset) num_rep 4 result size == 4: 2048000/2048000 >> >> With SIMD >> >> real 0m10.433s >> user 0m10.428s >> sys 0m0.000s >> >> Without SIMD >> >> real 0m19.344s >> user 0m19.340s >> sys 0m0.004s >> >> Callgrind estimated cycles for each crush_do_rule are in the same range: >> >> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4 >> kcachegrind crush.callgrind >> >> With SIMD : crush_do_rule is estimated to use 21 205 cycles >> Without SIMD : crush_do_rule is estimated to use 53 068 cycles >> >> This proof of concept relies on instructions that are available on all >> ARM & Intel processors, nothing complicated is going on. It is >> beneficial to crush maps that have more than four disks per host, more >> than four hosts per rack etc. It probably is a small win for an OSD or >> even a client. For crushmap validation it helps significantly since the >> MON are not able to run crushtool asynchronously and it needs to run >> within a few seconds (because it blocks the MON). >> >> The implementation is straightforward: it needs sub/xor/lshift/rshift. >> The only relatively tricky part is runtime / compile time detection of >> the SIMD instructions for both Intel and ARM processors. Luckily this >> has already been taken care of when integrating with the jerasure >> erasure code plugin. >> >> Is there any reason why it would not be good to implement this ? > > This is really cool! I agree that the straw2 O(n) calculation on each > node is the place to apply this. > > To answer your question, the only real risk/problem I see is that we need > to keep the perfectly in sync with the non-optimized variant since the > result has to be deterministic. The maintenance burden is small, I think, > since for that reason the code behavior doesn't really change, but we do > need to pretty exhaustively verify that the new implementation matches the > old one. Perhaps a set of unit tests that compile both variants and feed > it randomly sized and weighted straw2 buckets and feed lots of values > through? Right: the implementation is likely to be simple but it needs serious testing. I'll give it a try. Cheers > sage > >> >> Cheers >> >> [1] https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef >> >> -- >> Loïc Dachary, Artisan Logiciel Libre >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html