Re: SIMD accelerated crush_do_rule proof of concept

Loic Dachary <loic@xxxxxxxxxxx> · Mon, 29 Aug 2016 16:03:27 +0200

Hi Sage,

On 29/08/2016 15:55, Sage Weil wrote:
> On Mon, 29 Aug 2016, Loic Dachary wrote:
>> Hi,
>>
>> TL;DR: crush_do_rule using SIMD goes twice faster, the implementation is 
>> straightforward and would help with crushmap validation, is there any 
>> reason not to do it ?
>>
>> When resolving a crush rule (crush_do_rule in mapper.c), the straw2 
>> function (bucket_straw2_choose) calls the hashing function 
>> (crush_hash32_3) for each item in a bucket and keeps the best match. 
>> When a bucket has four items, the hash function can be run using SIMD 
>> instructions. Each item value is 32 bits and four can fit in a __m128i.
>>
>> I tried to inline the hash function when the conditions are right[1] and 
>> run a test to measure the difference.
>>
>> crushtool -o /tmp/t.map --num_osds 1024 --build node straw2 8 datacenter straw2 4 root straw2 0
>> time crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 2048000 --num-rep 4
>> rule 0 (replicated_ruleset), x = 1..2048000, numrep = 4..4
>> rule 0 (replicated_ruleset) num_rep 4 result size == 4:	2048000/2048000
>>
>> With SIMD
>>
>> real	0m10.433s
>> user	0m10.428s
>> sys	0m0.000s
>>
>> Without SIMD
>>
>> real	0m19.344s
>> user	0m19.340s
>> sys	0m0.004s
>>
>> Callgrind estimated cycles for each crush_do_rule are in the same range:
>>
>> rm crush.callgrind ; valgrind --tool=callgrind --callgrind-out-file=crush.callgrind crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 0 --min-x 1 --max-x 204800 --num-rep 4
>> kcachegrind crush.callgrind
>>
>> With SIMD    : crush_do_rule is estimated to use 21 205 cycles
>> Without SIMD : crush_do_rule is estimated to use 53 068 cycles
>>
>> This proof of concept relies on instructions that are available on all 
>> ARM & Intel processors, nothing complicated is going on. It is 
>> beneficial to crush maps that have more than four disks per host, more 
>> than four hosts per rack etc. It probably is a small win for an OSD or 
>> even a client. For crushmap validation it helps significantly since the 
>> MON are not able to run crushtool asynchronously and it needs to run 
>> within a few seconds (because it blocks the MON).
>>
>> The implementation is straightforward: it needs sub/xor/lshift/rshift. 
>> The only relatively tricky part is runtime / compile time detection of 
>> the SIMD instructions for both Intel and ARM processors. Luckily this 
>> has already been taken care of when integrating with the jerasure 
>> erasure code plugin.
>>
>> Is there any reason why it would not be good to implement this ?
> 
> This is really cool!  I agree that the straw2 O(n) calculation on each 
> node is the place to apply this.
> 
> To answer your question, the only real risk/problem I see is that we need 
> to keep the perfectly in sync with the non-optimized variant since the 
> result has to be deterministic.  The maintenance burden is small, I think, 
> since for that reason the code behavior doesn't really change, but we do 
> need to pretty exhaustively verify that the new implementation matches the 
> old one.  Perhaps a set of unit tests that compile both variants and feed 
> it randomly sized and weighted straw2 buckets and feed lots of values 
> through?

Right: the implementation is likely to be simple but it needs serious testing. I'll give it a try.

Cheers

> sage
> 
>>
>> Cheers
>>
>> [1] https://github.com/dachary/ceph/commit/71ae4584d9ed57f70aad718d0ffe206a01e91fef
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Loïc Dachary, Artisan Logiciel Libre
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html