Re: mempool and cacheline ping pong

Loïc Dachary <loic@xxxxxxxxxxx> · Mon, 5 Apr 2021 10:51:53 +0200



And the commit with the benchmark test: https://lab.fedeproxy.eu/ceph/ceph/-/commit/b8ab6380adfc028da8166704dbc1755260226375

On 05/04/2021 10:48, Loïc Dachary wrote:
> The version with sharding to avoid cacheline contention is indeed faster (about 5 times faster). I modified the test program to verify that it is consistently at least 2x faster. This is deliberately conservative: the goal is to guard against  a regression that would break the optimization entirely rather than trying to fine tune the optimization. There was such a regression in Ceph for a long time (fixed earlier this year) and it would be good if it does not happen again.
>
> In addition the test should also verify that the optimization actually relates to cacheline contention. If I understand correctly, the latest output I sent you shows that the non optimized version uses only one variable 
> and there is a cacheline contention reported by perf. c2c The optimized version however has no cacheline contention at all, which is the intended effect of the optimization.
>
> Is my reasoning correct so far?
>
> On 05/04/2021 09:27, Loïc Dachary wrote:
>> Morning Joe,
>>
>> On 05/04/2021 02:15, Joe Mario wrote:
>>> Hi Loïc:
>>>
>>> On Sun, Apr 4, 2021 at 4:14 PM Loïc Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>> wrote:
>>>
>>>     <snip>
>>>     > Is the above assumption correct?
>>>     Yes, absolutely right. I changed the variable to be 128 bytes aligned[0],
>>>     is it ok? Maybe there is a constant somewhere that provides this number (number of bytes to be "cache aligned") so it is not hard coded?
>>>
>>>
>>> Here are two ways you can get the cacheline size.
>>> One is by reading /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
>>> Another is with:  gcc -DLEVEL1_DCACHE_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` ...
>>> Another is with:  grep -m1 cache_alignment /proc/cpuinfo
>>>
>>> Most often it's 64 bytes.  I believe the power cpus are 128 bytes.  Itanium was 128 bytes.
>>>
>>> However, even on the X86 platforms where the cacheline size is 64 bytes, it's very often a good idea to pad your hot locks or hot data items out 
>> to 128 bytes (e.g. 2 cachelines instead of 1).
>>> The reason is this:  By default when Intel processors fetch a cacheline of data, the cpu will gratuitously fetch the next cacheline, just in case you need it.  However if that next cacheline is a different hot cacheline, the last thing you need is invalidate it with gratuitous writes.
>>>
>>> We have seen performance problems due to this, and the resolution was to pad the hot locks and variables out to 128 bytes.  Some of the big 
>> database vendors pad out to 128 bytes because of this as well.
>> Thanks for explaining: it makes sense now.
>>> I looked at the 2nd tar.gz file that you uploaded (ceph-c2c-jmario-2021-04-04-22-13.tar.gz ).
>>> As expected, the "without-sharding" case looked like it did earlier.
>>> However, in the "with-sharding" case, it didn't even look like your ceph_test_c2c program was even running.  I even dumped the raw samples 
> from the perf.data file and didn't see any loads or stores from the program.  Can you double check that it ran correctly?
>> It did not run, indeed. The version with sharding is faster and it finished before the measures started. The first observable evidence of the optimization, exciting :-) I changed the test program so that it keeps running forever and will be killed by the caller when it is no longer needed.
>>
>> The output was uploaded in ceph-c2c-jmario-2021-04-05-09-26.tar.gz
>>
>> Cheers
>>
>>
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx

-- 
Loïc Dachary, Artisan Logiciel Libre
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx