Re: mempool and cacheline ping pong

Loïc Dachary <loic@xxxxxxxxxxx> · Sun, 4 Apr 2021 16:44:49 +0200

Hi Joe,

The test program[0] was run with the commands you provided[1] and a 30 seconds "warm up" period. The output was uploaded to dropbox.redhat.com in a file named ceph-c2c-jmario-2021-04-04-16-28.tar.gz. There are two directories:

* with-sharding is with the optimization turned on
* without-sharding is without the optimization

I took a quick look and saw a difference in the perf_c2c_a_all_user_phys_data.out files. Without the optimization there is only one line and with the optimization there are 8 (the number of threads). The interpretation of this difference is beyond me and I'm very curious to read what you make of it, as well as the rest of the output :-)

A machine is dedicated to this test and doing nothing else. Now that it's ready I'll be able to adjust whatever you suggest the same day. Thanks again for your patience and willingness to share your expertise!

Cheers

[0] https://lab.fedeproxy.eu/ceph/ceph/-/blob/wip-mempool-cacheline-49781/src/test/test_c2c.cc
[1] https://lab.fedeproxy.eu/ceph/ceph/-/blob/wip-mempool-cacheline-49781/qa/standalone/c2c/c2c.sh

================================================
#
#        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  ---- Stores ----  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
#
      0      0x7fffff04c200     0       1  100.00%    35593    35593        0   140432    60712    79720    79720        0        0    25114        0         5    35593         0        0         0         0

=================================================
      Shared Cache Line Distribution Pareto
=================================================
#
#        ----- HITM -----  -- Store Refs --  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                         Shared
#   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss              Offset  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol         Object        Source:Line  Node
# .....  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ..............................  .............  .................  ....
#
  -------------------------------------------------------------
      0        0    35593    79720        0      0x7fffff04c200
  -------------------------------------------------------------
           0.00%  100.00%  100.00%    0.00%                0x30     0       1      0x55669fda67ba         0       550       450   140432         4  [.] std::__atomic_base<int>::o  ceph_test_c2c  atomic_base.h:548   0

=================================================
           Shared Data Cache Line Table
=================================================
#
#        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  ---- Stores ----  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
#
      0      0x7ffecdcb4700     0   71586  100.00%    22548    22548        0    84925    28141    56784    56784        0        0     5592        0         1    22548         0        0         0         0

=================================================
      Shared Cache Line Distribution Pareto
=================================================
#
#        ----- HITM -----  -- Store Refs --  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                         Shared
#   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss              Offset  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol         Object        Source:Line  Node
# .....  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ..............................  .............  .................  ....
#
  -------------------------------------------------------------
      0        0    22548    56784        0      0x7ffecdcb4700
  -------------------------------------------------------------
           0.00%   27.38%   15.11%    0.00%                 0x0     0       1      0x55cc50aa67ba         0       274       289    16378         2  [.] std::__atomic_base<int>::o  ceph_test_c2c  atomic_base.h:548   0
           0.00%    0.00%   14.67%    0.00%                 0x4     0       1      0x55cc50aa67ba         0         0         0     8331         1  [.] std::__atomic_base<int>::o  ceph_test_c2c  atomic_base.h:548   0
           0.00%   26.40%   15.26%    0.00%                 0x8     0       1      0x55cc50aa67ba         0       279       289    16572         1  [.] std::__atomic_base<int>::o  ceph_test_c2c  atomic_base.h:548   0
           0.00%   13.32%   14.72%    0.00%                 0xc     0       1      0x55cc50aa67ba         0       548       406    11974         2  [.] std::__atomic_base<int>::o  ceph_test_c2c  atomic_base.h:548   0
           0.00%   10.36%   12.93%    0.00%                0x10     0       1      0x55cc50aa67ba         0       423       299    10135         1  [.] std::__atomic_base<int>::o  ceph_test_c2c  atomic_base.h:548   0
           0.00%    9.82%   12.42%    0.00%                0x14     0       1      0x55cc50aa67ba         0       407       288     9638         1  [.] std::__atomic_base<int>::o  ceph_test_c2c  atomic_base.h:548   0
           0.00%   12.72%   14.89%    0.00%                0x18     0       1      0x55cc50aa67ba         0       546       410    11897         1  [.] std::__atomic_base<int>::o  ceph_test_c2c  atomic_base.h:548   0

On 18/03/2021 18:25, Loïc Dachary wrote:
> Hi Joe,
>
> I can't tell you how happy I am that you're around to help understand this :-) In the spirit of making baby steps to better understand what's going on, I'd like to run a small part of Ceph[0] that is designed to be optimized and avoid cacheline ping pong. The code running this part would be in a test (similar to an existing one[1]). It would be launched on a single Intel machine (as part of a teuthology run, the integration test tool specific to Ceph) and use the commands you suggest. After the first run I'll send the data to you for interpretation (sounds like consulting an oracle :-) ).
>
> With your help I'm hoping the integration test will assert that running the mempool with the optimization is at least X% faster than without the optimization and fail otherwise. That would be very helpful to guard against accidental regressions.
>
> Once this first goal is achieved, collecting data from a Ceph cluster running under load could follow the same methodology. I'm sure Mark Nelson will be most interested in this more ambitious target.
>
> If I'm not mistaken, the commands (let say they are in a ceph-c2c.sh script) should be run like this:
>
> * Run the software (be it mempool simulation or Ceph under load), let it warm up
> * Run ceph-c2c.sh (it won't take more than a minute or so to complete)
> * Collect the data and save them
> * Kill the software
>
> Is that correct?
>
> Cheers
>
> [0] https://github.com/ceph/ceph/blob/2b21735498c98299d5ce383011c3dbe25aaee70f/src/include/mempool.h#L195-L211
> [1] https://github.com/ceph/ceph/blob/2b21735498c98299d5ce383011c3dbe25aaee70f/src/test/test_mempool.cc#L405-L431
>
> On 18/03/2021 16:40, jmario@xxxxxxxxxx wrote:
>> Hi Loïc,
>> Per our email discussion, I'm happy to help.  If you or anyone else can run perf c2c, I will analyze the results and reply back with the findings.  
>>
>> The perf c2c output is a bit non-intuitive, but it conveys a lot.  I'm happy to share the findings.
>>
>> Here's what I recommend:
>>  1) Get on an Intel system where you're pushing Ceph really hard. (AMD uses different low level perf events that haven't been ported over yet.)
>>  2) Make sure the Ceph code you're running has debug info in it and isn't stripped. 
>>  3) This needs to be run on bare-metal.  The PEBS perf events used by c2c are not supported in a virtualized guest (Intel says support is coming in newer cpus).
>>  3) As any fyi, the less cpu pinning you do, the more cacheline contention c2c will expose. 
>>  4) Once you run the commands that I've appended below (as root), then tar up everything, data files and all, and lftp them to the location below:
>>
>>     $ lftp dropbox.redhat.com
>>     > cd /incoming
>>     > put unique-filename
>>
>> Please let me know the name of the files that you uploaded after you put them there.  I'll grab them.  
>> I just joined this list and I don't know if I'll get notified of the replies, so send me email when the files are there for me to grab.
>>
>> Does that sound OK?
>> Holler if you have any questions.
>> Joe
>>
>>     # First get some background system info 
>>     uname -a > uname.out
>>     lscpu > lscpu.out
>>     cat /proc/cmdline > cmdline.out
>>     timeout -s INT 10 vmstat -w 1 > vmstat.out
>>
>>     nodecnt=`lscpu|grep "NUMA node(" |awk '{print $3}'`
>>     for ((i=0; i<$nodecnt; i++))
>>     do
>>        cat /sys/devices/system/node/node${i}/meminfo > meminfo.$i.out
>>     done
>>     more `find /proc -name status` > proc_parent_child_status.out
>>     more /proc/*/numa_maps > numa_maps.out
>>     
>>     #
>>     # Get separate kernel and user perf-c2c stats
>>     #
>>     perf c2c record -a --ldlat=70 --all-user -o perf_c2c_a_all_user.data sleep 5 
>>     perf c2c report --stdio -i perf_c2c_a_all_user.data > perf_c2c_a_all_user.out 2>&1
>>     perf c2c report --full-symbols --stdio -i perf_c2c_a_all_user.data > perf_c2c_full-sym_a_all_user.out 2>&1
>>
>>     perf c2c record -g -a --ldlat=70 --all-user -o perf_c2c_g_a_all_user.data sleep 5 
>>     perf c2c report -g --stdio -i perf_c2c_g_a_all_user.data > perf_c2c_g_a_all_user.out 2>&1
>>
>>     perf c2c record -a --ldlat=70 --all-kernel -o perf_c2c_a_all_kernel.data sleep 4 
>>     perf c2c report --stdio -i perf_c2c_a_all_kernel.data > perf_c2c_a_all_kernel.out 2>&1
>>
>>     perf c2c record -g --ldlat=70 -a --all-kernel -o perf_c2c_g_a_all_kernel.data sleep 4 
>>     perf c2c report -g --stdio -i perf_c2c_g_a_all_kernel.data > perf_c2c_g_a_all_kernel.out 2>&1
>>
>>     #
>>     # Get combined kernel and user perf-c2c stats
>>     #
>>     perf c2c record -a --ldlat=70 -o perf_c2c_a_both.data sleep 4 
>>     perf c2c report --stdio -i perf_c2c_a_both.data > perf_c2c_a_both.out 2>&1
>>
>>     perf c2c record -g --ldlat=70 -a --all-kernel -o perf_c2c_g_a_both.data sleep 4 
>>     perf c2c report -g --stdio -i perf_c2c_g_a_both.data > perf_c2c_g_a_both.out 2>&1
>>
>>     #
>>     # Get all-user physical addr stats, in case multiple threads or processes are 
>>     # accessing shared memory with different vaddrs.
>>     #
>>     perf c2c record --phys-data -a --ldlat=70 --all-user -o perf_c2c_a_all_user_phys_data.data sleep 5 
>>     perf c2c report --stdio -i perf_c2c_a_all_user_phys_data.data > perf_c2c_a_all_user_phys_data.out 2>&1
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx

-- 
Loïc Dachary, Artisan Logiciel Libre

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx