Re: mempool and cacheline ping pong

Loïc Dachary <loic@xxxxxxxxxxx> · Sat, 24 Apr 2021 20:22:42 +0200

Hi Joe,

I looked into perf script[0], trying to extract information related to the c2c test, in an attempt to make your "
grep ceph_test_c2c t.script  |grep ldlat |sed -e 's/^.*LCK//' |awk '{s+=$2;c+=1}END {print s " " c " " s/c}'" line more flexible. The first problem I had was to figure out which fields are displayed by default: 
the --fields does not say and there does not seem to be a way to show the 
field labels. I took a look at the sources[1] and while the beginning (command name, pid/tid, cpu) can be guessed easily, it becomes blurry towards the end of a line like:

ceph_test_c2c 184970 [000] 617917.424796:      37057         cpu/mem-stores/P:     7ffcf821fef4         5080144 L1 hit|SNP N/A|TLB N/A|LCK N/A                               0     55629c5f8708 [unknown] (/home/loic/ceph/build/bin/ceph_test_c2c)       5b4f81ef4

It looked like I could loose myself into this and did not investigate further.

I did not followup on your suggestion to run perf on a running Ceph cluster either (sorry to disapoint :-( ) because I'd like to complete this simple task first. Trying to undertake anything more ambitious would be difficult.

However! As you suggested I modified the test[2] with your new shell script and ran it on a real server (128 cores, 700+GB RAM). You will find the 
result in dropbox.redhat.com under the name ceph-c2c-jmario-2021-04-24-20-13.tar.gz. Hopefully they are consistent: I can't judge for myself because the content of the perf report is still quite mysterious. The metric I 
set is satisfied (optimization is at least twice faster) and it is good enough to propose this for integration in the Ceph test suite. The test will fail if the optimization is accidentally disabled by a regression.

My next step is to submit the test for inclusion in Ceph, when the validation completes successfully[3].
Cheers

[0] https://linux.die.net/man/1/perf-script
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/builtin-script.c?h=v5.12-rc8#n734
[2] https://lab.fedeproxy.eu/ceph/ceph/-/blob/864e22a77ccac23454407573c125a17c55138eb7/qa/standalone/c2c/c2c.sh#L5
[3] http://pulpito.front.sepia.ceph.com/dachary-2021-04-24_18:20:26-rados:standalone:workloads:c2c.yaml-wip-mempool-cacheline-49781-distro-basic-smithi/

On 05/04/2021 19:38, Joe Mario wrote:
>
>
> On Mon, Apr 5, 2021 at 10:38 AM Loïc Dachary <loic@xxxxxxxxxxx <mailto:loic@xxxxxxxxxxx>> wrote:
> [snip]
>
>     > The cpu doing that atomic instruction needs to first get ownership of the cacheline.  If no other threads of execution are also trying to get ownership of that cacheline, then ownership is granted rather quickly. 
>     > If, however, there are many other threads of execution trying to get ownership of that cacheline, then all those cpus must "get in line" and wait their turn.
>     >
>     And if there are N CPUs and N*2 threads, the only way to avoid cacheline contention is if all of these threads use a different (cacheline aligned)
>     variable. Is that correct? If it is correct, I wonder if cacheline contention has a linear impact on performances. If there is cacheline contention on 1 variable with N threads on N CPUS, will the performance degradation be the same if there is cacheline contention on 10 variables and the same number of threads & CPUS? Or will it get worse because there is some sort of amplification?
>
>
> In your test case, you avoided all cacheline contention by putting every lock into its own cacheline.  In practice, however, multiple threads need to be contending for the same locks to modify shared data.
>
> The goal is to look at your application's hottest contended cachelines to see if that contention can be minimized.  Some of the ways to do that include:
> 1) Seeing if the number of accesses to that line can be minimized, especially the writers.
> 2) Making sure multiple hot data variables don't share the same cacheline.
> 3) Looking to see if the accesses to the hot cachelines are coming from 
the same numa node as where the hot data lives.  This isn't always possible, but it's good to examine it.
>
>  
>
>     I'd like for the test script to extract those number from the files 
produced by perf c2c. How do you suggest I go about this?
>
>
> Doing it for this test case was somewhat trivial, albeit fragile.  
That's because the "without-sharding" version only had one contended cacheline, and the "with-sharding" version had no contended cacheline.
> Because of that, I was able to dump the raw data and add up the load latencies for the test program's load instructions.
> The steps I used were:
>   # perf script -i perf_c2c_a_all_user.data > t.script -f
>   # grep ceph_test_c2c t.script  |grep ldlat |sed -e 's/^.*LCK//' |awk '{s+=$2;c+=1}END {print s " " c " " s/c}'
>
> However the above simple script won't work when there's more than one hot cacheline involved.  You can still find the data you want, it just gets more complicated.
> Plus, the real value in perf c2c is not just seeing the load latencies, 
but rather to learn everything about the contention which will then help guide how to minimize it.  It provides a lot of insight into what's happening.
>
> How does this approach sound?
>  1) You set up Ceph to run on a bigger multi-node server with fast 
storage. 
>  2) Run the attached script, which is just an updated version of the script you've been running.
>  3) Then we can set up a shared video call where I can walk you through the perf c2c output pointing out all the key pieces of information.
>  4) With the insight from "3" above, you can then decide what you might want to automate and how it might be done.
>
> Does that sound reasonable?
>
> See the attached "run_c2c_ceph2.sh" script.
> Joe
>
>

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
OpenPGP_signature

Description: OpenPGP digital signature
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx