RE: Showing my ignorance - kernel workers

"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@xxxxxxxx> · Tue, 14 Jun 2022 13:48:19 +0000

Hi,
I apologize for resurrecting an old thread......I'm trying to gain control of the system processes md<arraynumber>_raid6, as I have two RAID6 software raid volumes on an 8 NUMA node system  (AMD Milan dual socket, NUMAs per socket set to 4) and I'm trying to not pay the relative latency penalty crossing NUMA nodes:
Relative latency matrix (name NUMALatency kind 5) between 8 NUMANodes (depth -3) by logical indexes:
  index     0     1     2     3     4     5     6     7
      0    10    12    12    12    32    32    32    32
      1    12    10    12    12    32    32    32    32
      2    12    12    10    12    32    32    32    32
      3    12    12    12    10    32    32    32    32
      4    32    32    32    32    10    12    12    12
      5    32    32    32    32    12    10    12    12
      6    32    32    32    32    12    12    10    12
      7    32    32    32    32    12    12    12    10

In all of my walking of an strace of  mdadm, walking udev, walking systemd, reading everything I can find on mdraid and controlling it, the best I can figure out is that 
ioctl(FD,RUN_ARRAY) must start the parity engine process  

I would like NUMA control of this parity engine process if possible.....

[root@rebel00 md]# cat /proc/mdstat   (ignore the resync below - I'm mid experiment to see if I can recover an array I got evil with using blkdiscard)
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid6 nvme12n1p1[0] nvme23n1p1[11] nvme22n1p1[10] nvme21n1p1[9] nvme20n1p1[8] nvme19n1p1[7] nvme18n1p1[6] nvme17n1p1[5] nvme16n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme13n1p1[1]
      37506037760 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      bitmap: 0/28 pages [0KB], 65536KB chunk

md0 : active raid6 nvme0n1p1[0] nvme11n1p1[11] nvme10n1p1[10] nvme9n1p1[9] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme4n1p1[4] nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1]
      150027868160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU]
      [==>..................]  resync = 13.4% (2017478592/15002786816) finish=193.0min speed=1120965K/sec
      bitmap: 0/112 pages [0KB], 65536KB chunk

unused devices: <none>

These are the NUMA mappings for my nvme drives (I can't change the configuration, they are what they are based upon the Milan dual socket architecture on a Dell R7525)
[root@rebel00 md]# map_numa.sh
device: nvme0 numanode: 3
device: nvme1 numanode: 3
device: nvme2 numanode: 3
device: nvme3 numanode: 3
device: nvme4 numanode: 2
device: nvme5 numanode: 2
device: nvme6 numanode: 2
device: nvme7 numanode: 2
device: nvme8 numanode: 2
device: nvme9 numanode: 2
device: nvme10 numanode: 2
device: nvme11 numanode: 2
device: nvme12 numanode: 5
device: nvme13 numanode: 5
device: nvme14 numanode: 5
device: nvme15 numanode: 5
device: nvme16 numanode: 5
device: nvme17 numanode: 5
device: nvme18 numanode: 5
device: nvme19 numanode: 5
device: nvme20 numanode: 4
device: nvme21 numanode: 4
device: nvme22 numanode: 4
device: nvme23 numanode: 4

[root@rebel00 md]# !! | grep -v kworker
ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm  | egrep 'md|raid' | grep -v systemd | grep -v mlx | grep -v kworker
    PID     TID CLS RTPRIO  NI PRI NUMA PSR %CPU STAT WCHAN  COMMAND
    806     806 TS       -   5  14    1  24  0.0 SN   0xffff ksmd
    867     867 TS       - -20  39    1  30  0.0 I<   -      md
   2439    2439 TS       - -20  39    3  48  0.0 I<   -      raid5wq
   3444    3444 TS       -   0  19    3  53  3.0 R    -      md0_raid6
   3805    3805 TS       -   0  19    7 113  0.0 Ss   -      lsmd
1400292 1400292 TS       -   0  19    2  32 99.2 R    -      md0_resync
1403811 1403811 TS       -   0  19    7 126  0.0 S    -      md1_raid6
1405670 1405670 TS       -   0  19    4  66  0.0 Ss   -      amd.py

I would like to pin the md0_raid6 process to NUMA2 and the md1_raid6 process to NUMA5.....I know I can do it post start of the process with taskset, but I can't be sure of how to get the memory from that process (pre task set) moved to the appropriate NUMA node.....If you see, I got lucky and md0_raid6 is on NUMA3 (almost the 2 I wanted), but md1_raid6 is on NUMA7, where I'd prefer it to be NUMA5 (or 4 worst case).....

I'm open to suggestions , but with all of the NUMAness of these advanced nodes getting more and more complicated (there ends up being a PCIe x16 per NUMA domain, with NUMA 2 on socket0 stealing an xgmi2 link to get 16 more lanes on NUMA2 (32 lanes total, thus the 8 drives on my 12 in the RAID6)  and similarly on socket1 with NUMA5 ending up with 32 lanes (8 drives of my 12 in the RAID6)

Any help is appreciated and longer term, I'd be thrilled if there were an mdadm.conf parameter to  take control of this or a /sys/module/md_mod/parameters

Without understanding how ioctl(FD,RUN_ARRAY) really works, I haven't pursued whether I could possibly "hack something together"   in /usr/lib/udev/rules.d/64-md-raid-assembly.rules  by wrappering the mdadm calls with numactls of my choosing .....

Regards,
Jim Finlayson
US Department of Defense

-----Original Message-----
From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> 
Sent: Wednesday, January 26, 2022 3:18 PM
To: linux-raid@xxxxxxxxxxxxxxx
Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
Subject: Showing my ignorance - kernel workers

All,
I apologize in advance if you can point me to something I can read about mdraid besides the source code.  I'm beyond the bounds of my understanding of Linux.   Background, I do a bunch of NUMA aware computing.   I have two systems configured identically with a NUMA node 0 focused RAID5 LUN containing NUMA node 0 nvme drives  and a NUMA node 1 focused RAID5 LUN identically configured.  9+1 nvme, 128KB stripe, xfs sitting on top, 64KB O_DIRECT reads from the application.

On one system, the kernel worker for each of the two MD's matches the NUMA node where the drives are located, yet on a second system, they both sit on NUMA node 0.   I'm speculating that I could get more consistent performance of the identical LUNs if I could tie the kernel worker to the proper NUMA domain.   Is my speculation accurate, if so, how might I go about this or is this a feature request???

Both systems are running the same kernel on top of RHEL8.
uname -r
5.15.13-1.el8.elrepo.x86_64

System 1:

# ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm  | egrep 'md|raid' | grep -v systemd | grep -v mlx
    PID     TID CLS RTPRIO  NI PRI NUMA PSR %CPU STAT WCHAN  COMMAND
   1559    1559 TS       -   5  14    1 244  0.0 SN   -      ksmd
   1627    1627 TS       - -20  39    1 196  0.0 I<   -      md
   3734    3734 TS       - -20  39    1 110  0.0 I<   -      raid5wq
   3752    3752 TS       -   0  19    0  22 10.5 S    -      md0_raid5
   3753    3753 TS       -   0  19    1 208 11.4 S    -      md1_raid5
   3838    3838 TS       -   0  19    0  57  0.0 Ss   -      lsmd

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.65    0.00    5.43    0.28    0.00   93.63

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
md0           1263604.00    0.00 62411724.00      0.00     0.00     0.00   0.00   0.00    1.94    0.00 2451.89    49.39     0.00   0.00 100.00
md1           1116529.00    0.00 55157228.00      0.00     0.00     0.00   0.00   0.00    2.45    0.00 2733.76    49.40     0.00   0.00 100.

System 2:

ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm  | egrep 'md|raid' | grep -v systemd | grep -v mlx
    PID     TID CLS RTPRIO  NI PRI NUMA PSR %CPU STAT WCHAN  COMMAND
   1492    1492 TS       -   5  14    1 200  0.0 SN   -      ksmd
   1560    1560 TS       - -20  39    1 200  0.0 I<   -      md
   3810    3810 TS       - -20  39    0 137  0.0 I<   -      raid5wq
   3811    3811 TS       -   0  19    0 148  0.0 S    -      md0_raid5
   3824    3824 TS       -   0  19    0 167  0.0 S    -      md1_raid5
   3929    3929 TS       -   0  19    1 115  0.0 Ss   -      lsmd

vg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.58    0.00    5.61    0.29    0.00   93.51

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
md0           1118252.00    0.00 55171048.00      0.00     0.00     0.00   0.00   0.00    1.79    0.00 2002.27    49.34     0.00   0.00 100.00
md1           1262715.00    0.00 62342424.00      0.00     0.00     0.00   0.00   0.00    0.61    0.00 769.19    49.37     0.00   0.00 100.00

Jim Finlayson
U.S. Department of Defense