On Mon, 29 Aug 2016, Mark Nelson wrote: > Anything we can do to help on the CPU usage front is a win IMHO, though I > would be interested in seeing an example where we are spending a lot of time > on crush in a real usage scenario? The monitor prime_pg_temp has to calculate a crush mapping for every PG when the osdmap changes in significant ways. A 2x improvement there is a big help since the work has to be timeboxed and aborted if it runs too long. Same goes for the crushtool test that runs whenever the crush map changes. sage > > Mark > > On 08/29/2016 06:42 AM, Loic Dachary wrote: > > Hi, > > > > TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is > > straightforward and would help with crushmap validation, is there any reason > > not to do it ? > > > > When resolving a crush rule (crush_do_rule in mapper.c), the straw2 function > > (bucket_straw2_choose) calls the hashing function (crush_hash32_3) for each > > item in a bucket and keeps the best match. When a bucket has four items, the > > hash function can be run using SIMD instructions. Each item value is 32 bits > > and four can fit in a __m128i. > > > > I tried to inline the hash function when the conditions are right[1] and run > > a test to measure the difference. > > > > crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter > > straw2 4 root straw2 0 > > time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test > > --rule 0 --min-x 1 --max-x 2048000 --num-rep 4 > > rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4 > > rule 0 (replicated_ruleset) num_rep 4 result size == 4: > > 2048000/2048000 > > > > With SIMD > > > > real 0m10.433s > > user 0m10.428s > > sys 0m0.000s > > > > Without SIMD > > > > real 0m19.344s > > user 0m19.340s > > sys 0m0.004s > > > > Callgrind estimated cycles for each crush_do_rule are in the same range: > > > > rm crush.callgrind ; valgrind --tool=callgrind > > --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map > > --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x > > 204800 --num-rep 4 > > kcachegrind crush.callgrind > > > > With SIMD : crush_do_rule is estimated to use 21 205 cycles > > Without SIMD : crush_do_rule is estimated to use 53 068 cycles > > > > This proof of concept relies on instructions that are available on all ARM & > > Intel processors, nothing complicated is going on. It is beneficial to crush > > maps that have more than four disks per host, more than four hosts per rack > > etc. It probably is a small win for an OSD or even a client. For crushmap > > validation it helps significantly since the MON are not able to run > > crushtool asynchronously and it needs to run within a few seconds (because > > it blocks the MON). > > > > The implementation is straightforward: it needs sub/xor/lshift/rshift. The > > only relatively tricky part is runtime / compile time detection of the SIMD > > instructions for both Intel and ARM processors. Luckily this has already > > been taken care of when integrating with the jerasure erasure code plugin. > > > > Is there any reason why it would not be good to implement this ? > > > > Cheers > > > > [1] > > https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html