And the commit with the benchmark test: https://lab.fedeproxy.eu/ceph/ceph/-/commit/b8ab6380adfc028da8166704dbc1755260226375 On 05/04/2021 10:48, Loïc Dachary wrote: > The version with sharding to avoid cacheline contention is indeed faster (about 5 times faster). I modified the test program to verify that it is consistently at least 2x faster. This is deliberately conservative: the goal is to guard against a regression that would break the optimization entirely rather than trying to fine tune the optimization. There was such a regression in Ceph for a long time (fixed earlier this year) and it would be good if it does not happen again. > > In addition the test should also verify that the optimization actually relates to cacheline contention. If I understand correctly, the latest output I sent you shows that the non optimized version uses only one variable > and there is a cacheline contention reported by perf. c2c The optimized version however has no cacheline contention at all, which is the intended effect of the optimization. > > Is my reasoning correct so far? > > On 05/04/2021 09:27, Loïc Dachary wrote: >> Morning Joe, >> >> On 05/04/2021 02:15, Joe Mario wrote: >>> Hi Loïc: >>> >>> On Sun, Apr 4, 2021 at 4:14 PM Loïc Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>> wrote: >>> >>> <snip> >>> > Is the above assumption correct? >>> Yes, absolutely right. I changed the variable to be 128 bytes aligned[0], >>> is it ok? Maybe there is a constant somewhere that provides this number (number of bytes to be "cache aligned") so it is not hard coded? >>> >>> >>> Here are two ways you can get the cacheline size. >>> One is by reading /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size >>> Another is with: gcc -DLEVEL1_DCACHE_LINESIZE=`getconf LEVEL1_DCACHE_LINESIZE` ... >>> Another is with: grep -m1 cache_alignment /proc/cpuinfo >>> >>> Most often it's 64 bytes. I believe the power cpus are 128 bytes. Itanium was 128 bytes. >>> >>> However, even on the X86 platforms where the cacheline size is 64 bytes, it's very often a good idea to pad your hot locks or hot data items out >> to 128 bytes (e.g. 2 cachelines instead of 1). >>> The reason is this: By default when Intel processors fetch a cacheline of data, the cpu will gratuitously fetch the next cacheline, just in case you need it. However if that next cacheline is a different hot cacheline, the last thing you need is invalidate it with gratuitous writes. >>> >>> We have seen performance problems due to this, and the resolution was to pad the hot locks and variables out to 128 bytes. Some of the big >> database vendors pad out to 128 bytes because of this as well. >> Thanks for explaining: it makes sense now. >>> I looked at the 2nd tar.gz file that you uploaded (ceph-c2c-jmario-2021-04-04-22-13.tar.gz ). >>> As expected, the "without-sharding" case looked like it did earlier. >>> However, in the "with-sharding" case, it didn't even look like your ceph_test_c2c program was even running. I even dumped the raw samples > from the perf.data file and didn't see any loads or stores from the program. Can you double check that it ran correctly? >> It did not run, indeed. The version with sharding is faster and it finished before the measures started. The first observable evidence of the optimization, exciting :-) I changed the test program so that it keeps running forever and will be killed by the caller when it is no longer needed. >> >> The output was uploaded in ceph-c2c-jmario-2021-04-05-09-26.tar.gz >> >> Cheers >> >> >> _______________________________________________ >> Dev mailing list -- dev@xxxxxxx >> To unsubscribe send an email to dev-leave@xxxxxxx -- Loïc Dachary, Artisan Logiciel Libre _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx