MDRAID NVMe performance question, but I don't know what I don't know

"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@xxxxxxxx> · Tue, 11 Jan 2022 16:03:05 +0000

Hi,
Sorry this is a long read.   If you want to get to the gist of it, look for "<KEY>" for key points.   I'm having some issues with where to find information to troubleshoot mdraid performance issues.   The latest "rathole" I'm going down is that I have two identically configured mdraids, 1 per NUMA node on a dual socket AMD Rome with "numas per socket" set to 1 in the BIOS.   Things are cranking with a 64K blocksize but I have a substantial disparity between NUMA0's mdraid and NUMA1's.     

[root@hornet04 block]# uname -r
<KEY> 5.15.13-1.el8.elrepo.x86_64

<KEY>  [root@hornet04 block]# cat /proc/mdstat  (md127 is NUMA 0, md126 is NUMA 1).
Personalities : [raid6] [raid5] [raid4] 
md126 : active raid5 nvme22n1p1[10] nvme20n1p1[7] nvme21n1p1[8] nvme18n1p1[5] nvme19n1p1[6] nvme17n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme12n1p1[0] nvme13n1p1[1]
      135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
      bitmap: 0/112 pages [0KB], 65536KB chunk

md127 : active raid5 nvme9n1p1[10] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme3n1p1[3] nvme4n1p1[4] nvme2n1p1[2] nvme1n1p1[1] nvme0n1p1[0]
      135007147008 blocks super 1.2 level 5, 128k chunk, algorithm 2 [10/10] [UUUUUUUUUU]
      bitmap: 0/112 pages [0KB], 65536KB chunk

unused devices: <none>

I'm running numa aware identical FIOs, but getting the following in iostat (numa 0 mdraid, outperforms numa 1 mdraid by 12GB/s)

[root@hornet04 ~]#  iostat -xkz 1 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.20    0.00    3.35    0.00    0.00   96.45

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
nvme2c2n1     72856.00    0.00 4662784.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.50    64.00     0.00   0.01 100.00
nvme3c3n1     73077.00    0.00 4676928.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.94    64.00     0.00   0.01 100.00
nvme4c4n1     73013.00    0.00 4672896.00      0.00     0.00     0.00   0.00   0.00    0.69    0.00  50.35    64.00     0.00   0.01 100.00
<KEY> nvme18c18n1   54384.00    0.00 3480576.00      0.00     0.00     0.00   0.00   0.00  144.80    0.00 7874.85    64.00     0.00   0.02 100.00
nvme5c5n1     72841.00    0.00 4661824.00      0.00     0.00     0.00   0.00   0.00    0.70    0.00  51.01    64.00     0.00   0.01 100.00
nvme7c7n1     72220.00    0.00 4622080.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  48.61    64.00     0.00   0.01 100.00
nvme22c22n1   54652.00    0.00 3497728.00      0.00     0.00     0.00   0.00   0.00    0.64    0.00  34.73    64.00     0.00   0.02 100.00
nvme12c12n1   54756.00    0.00 3504384.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.34    64.00     0.00   0.02 100.00
nvme14c14n1   54517.00    0.00 3489088.00      0.00     0.00     0.00   0.00   0.00    0.65    0.00  35.66    64.00     0.00   0.02 100.00
nvme6c6n1     72721.00    0.00 4654144.00      0.00     0.00     0.00   0.00   0.00    0.68    0.00  49.77    64.00     0.00   0.01 100.00
nvme21c21n1   54731.00    0.00 3502784.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  36.46    64.00     0.00   0.02 100.00
nvme9c9n1     72661.00    0.00 4650304.00      0.00     0.00     0.00   0.00   0.00    0.71    0.00  51.35    64.00     0.00   0.01 100.00
nvme17c17n1   54462.00    0.00 3485568.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.09    64.00     0.00   0.02 100.00
nvme20c20n1   54463.00    0.00 3485632.00      0.00     0.00     0.00   0.00   0.00    0.66    0.00  36.10    64.00     0.00   0.02 100.10
nvme13c13n1   54910.00    0.00 3514240.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00  33.45    64.00     0.00   0.02 100.00
nvme8c8n1     72622.00    0.00 4647808.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  48.52    64.00     0.00   0.01 100.00
nvme15c15n1   54543.00    0.00 3490752.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00  33.28    64.00     0.00   0.02 100.00
nvme0c0n1     73215.00    0.00 4685760.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  49.41    64.00     0.00   0.01 100.00
nvme19c19n1   55034.00    0.00 3522176.00      0.00     0.00     0.00   0.00   0.00    0.67    0.00  36.93    64.00     0.00   0.02 100.10
<KEY> nvme1c1n1     72672.00    0.00 4650944.00      0.00     0.00     0.00   0.00   0.00  106.98    0.00 7774.54    64.00     0.00   0.01 100.00
<KEY> md127         727871.00    0.00 46583744.00      0.00     0.00     0.00   0.00   0.00   11.30    0.00 8221.92    64.00     0.00   0.00 100.00
<KEY> md126         546553.00    0.00 34979392.00      0.00     0.00     0.00   0.00   0.00   14.99    0.00 8194.91    64.00     0.00   0.00 100.10

<KEY> I started chasing the aqu_sz and r_await to see if I have a device issue or if these are known mdraid "features" when I started to try to find the kernel workers and start chasing kernel workers when it became apparent to me that I DON'T KNOW WHAT I'M DOING OR WHAT TO DO NEXT. Any guidance is appreciated.  Given 1 drive per NUMA is showing the bad behavior, I'm reluctant to point the finger at hardware.

[root@hornet04 ~]# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  2
Core(s) per socket:  64
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
BIOS Vendor ID:      Advanced Micro Devices, Inc.
CPU family:          23
Model:               49
Model name:          AMD EPYC 7742 64-Core Processor
BIOS Model name:     AMD EPYC 7742 64-Core Processor                
Stepping:            0
CPU MHz:             3243.803
BogoMIPS:            4491.53
Virtualization:      AMD-V
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            16384K
<KEY> NUMA node0 CPU(s):   0-63,128-191
<KEY>  NUMA node1 CPU(s):   64-127,192-255
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca

<KEY> When I start doing some basic debugging - not a Linux ninja by far, I see the following, but what is throwing me is seeing (at least these workers that I suspect have to do with md, all running on NUMA node 1.   This is catching me by surprise.   Are there other workers that I'm missing?????

ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan:14,comm  | egrep 'md|raid' | grep -v systemd | grep -v mlx
    PID     TID CLS RTPRIO  NI PRI NUMA PSR %CPU STAT WCHAN          COMMAND
   1522    1522 TS       -   5  14    1 208  0.0 SN   -              ksmd
   1590    1590 TS       - -20  39    1 220  0.0 I<   -              md
   3688    3688 TS       - -20  39    1 198  0.0 I<   -              raid5wq
   3693    3693 TS       -   0  19    1 234  0.0 S    -              md126_raid5
   3694    3694 TS       -   0  19    1  95  0.0 S    -              md127_raid5
   3788    3788 TS       -   0  19    1 240  0.0 Ss   -              lsmdcat /

Jim Finlayson
U.S. Department of Defense