Hi, I apologize for resurrecting an old thread......I'm trying to gain control of the system processes md<arraynumber>_raid6, as I have two RAID6 software raid volumes on an 8 NUMA node system (AMD Milan dual socket, NUMAs per socket set to 4) and I'm trying to not pay the relative latency penalty crossing NUMA nodes: Relative latency matrix (name NUMALatency kind 5) between 8 NUMANodes (depth -3) by logical indexes: index 0 1 2 3 4 5 6 7 0 10 12 12 12 32 32 32 32 1 12 10 12 12 32 32 32 32 2 12 12 10 12 32 32 32 32 3 12 12 12 10 32 32 32 32 4 32 32 32 32 10 12 12 12 5 32 32 32 32 12 10 12 12 6 32 32 32 32 12 12 10 12 7 32 32 32 32 12 12 12 10 In all of my walking of an strace of mdadm, walking udev, walking systemd, reading everything I can find on mdraid and controlling it, the best I can figure out is that ioctl(FD,RUN_ARRAY) must start the parity engine process I would like NUMA control of this parity engine process if possible..... [root@rebel00 md]# cat /proc/mdstat (ignore the resync below - I'm mid experiment to see if I can recover an array I got evil with using blkdiscard) Personalities : [raid6] [raid5] [raid4] md1 : active raid6 nvme12n1p1[0] nvme23n1p1[11] nvme22n1p1[10] nvme21n1p1[9] nvme20n1p1[8] nvme19n1p1[7] nvme18n1p1[6] nvme17n1p1[5] nvme16n1p1[4] nvme15n1p1[3] nvme14n1p1[2] nvme13n1p1[1] 37506037760 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU] bitmap: 0/28 pages [0KB], 65536KB chunk md0 : active raid6 nvme0n1p1[0] nvme11n1p1[11] nvme10n1p1[10] nvme9n1p1[9] nvme8n1p1[8] nvme7n1p1[7] nvme6n1p1[6] nvme5n1p1[5] nvme4n1p1[4] nvme3n1p1[3] nvme2n1p1[2] nvme1n1p1[1] 150027868160 blocks super 1.2 level 6, 512k chunk, algorithm 2 [12/12] [UUUUUUUUUUUU] [==>..................] resync = 13.4% (2017478592/15002786816) finish=193.0min speed=1120965K/sec bitmap: 0/112 pages [0KB], 65536KB chunk unused devices: <none> These are the NUMA mappings for my nvme drives (I can't change the configuration, they are what they are based upon the Milan dual socket architecture on a Dell R7525) [root@rebel00 md]# map_numa.sh device: nvme0 numanode: 3 device: nvme1 numanode: 3 device: nvme2 numanode: 3 device: nvme3 numanode: 3 device: nvme4 numanode: 2 device: nvme5 numanode: 2 device: nvme6 numanode: 2 device: nvme7 numanode: 2 device: nvme8 numanode: 2 device: nvme9 numanode: 2 device: nvme10 numanode: 2 device: nvme11 numanode: 2 device: nvme12 numanode: 5 device: nvme13 numanode: 5 device: nvme14 numanode: 5 device: nvme15 numanode: 5 device: nvme16 numanode: 5 device: nvme17 numanode: 5 device: nvme18 numanode: 5 device: nvme19 numanode: 5 device: nvme20 numanode: 4 device: nvme21 numanode: 4 device: nvme22 numanode: 4 device: nvme23 numanode: 4 [root@rebel00 md]# !! | grep -v kworker ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | egrep 'md|raid' | grep -v systemd | grep -v mlx | grep -v kworker PID TID CLS RTPRIO NI PRI NUMA PSR %CPU STAT WCHAN COMMAND 806 806 TS - 5 14 1 24 0.0 SN 0xffff ksmd 867 867 TS - -20 39 1 30 0.0 I< - md 2439 2439 TS - -20 39 3 48 0.0 I< - raid5wq 3444 3444 TS - 0 19 3 53 3.0 R - md0_raid6 3805 3805 TS - 0 19 7 113 0.0 Ss - lsmd 1400292 1400292 TS - 0 19 2 32 99.2 R - md0_resync 1403811 1403811 TS - 0 19 7 126 0.0 S - md1_raid6 1405670 1405670 TS - 0 19 4 66 0.0 Ss - amd.py I would like to pin the md0_raid6 process to NUMA2 and the md1_raid6 process to NUMA5.....I know I can do it post start of the process with taskset, but I can't be sure of how to get the memory from that process (pre task set) moved to the appropriate NUMA node.....If you see, I got lucky and md0_raid6 is on NUMA3 (almost the 2 I wanted), but md1_raid6 is on NUMA7, where I'd prefer it to be NUMA5 (or 4 worst case)..... I'm open to suggestions , but with all of the NUMAness of these advanced nodes getting more and more complicated (there ends up being a PCIe x16 per NUMA domain, with NUMA 2 on socket0 stealing an xgmi2 link to get 16 more lanes on NUMA2 (32 lanes total, thus the 8 drives on my 12 in the RAID6) and similarly on socket1 with NUMA5 ending up with 32 lanes (8 drives of my 12 in the RAID6) Any help is appreciated and longer term, I'd be thrilled if there were an mdadm.conf parameter to take control of this or a /sys/module/md_mod/parameters Without understanding how ioctl(FD,RUN_ARRAY) really works, I haven't pursued whether I could possibly "hack something together" in /usr/lib/udev/rules.d/64-md-raid-assembly.rules by wrappering the mdadm calls with numactls of my choosing ..... Regards, Jim Finlayson US Department of Defense -----Original Message----- From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Sent: Wednesday, January 26, 2022 3:18 PM To: linux-raid@xxxxxxxxxxxxxxx Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Subject: Showing my ignorance - kernel workers All, I apologize in advance if you can point me to something I can read about mdraid besides the source code. I'm beyond the bounds of my understanding of Linux. Background, I do a bunch of NUMA aware computing. I have two systems configured identically with a NUMA node 0 focused RAID5 LUN containing NUMA node 0 nvme drives and a NUMA node 1 focused RAID5 LUN identically configured. 9+1 nvme, 128KB stripe, xfs sitting on top, 64KB O_DIRECT reads from the application. On one system, the kernel worker for each of the two MD's matches the NUMA node where the drives are located, yet on a second system, they both sit on NUMA node 0. I'm speculating that I could get more consistent performance of the identical LUNs if I could tie the kernel worker to the proper NUMA domain. Is my speculation accurate, if so, how might I go about this or is this a feature request??? Both systems are running the same kernel on top of RHEL8. uname -r 5.15.13-1.el8.elrepo.x86_64 System 1: # ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | egrep 'md|raid' | grep -v systemd | grep -v mlx PID TID CLS RTPRIO NI PRI NUMA PSR %CPU STAT WCHAN COMMAND 1559 1559 TS - 5 14 1 244 0.0 SN - ksmd 1627 1627 TS - -20 39 1 196 0.0 I< - md 3734 3734 TS - -20 39 1 110 0.0 I< - raid5wq 3752 3752 TS - 0 19 0 22 10.5 S - md0_raid5 3753 3753 TS - 0 19 1 208 11.4 S - md1_raid5 3838 3838 TS - 0 19 0 57 0.0 Ss - lsmd avg-cpu: %user %nice %system %iowait %steal %idle 0.65 0.00 5.43 0.28 0.00 93.63 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util md0 1263604.00 0.00 62411724.00 0.00 0.00 0.00 0.00 0.00 1.94 0.00 2451.89 49.39 0.00 0.00 100.00 md1 1116529.00 0.00 55157228.00 0.00 0.00 0.00 0.00 0.00 2.45 0.00 2733.76 49.40 0.00 0.00 100. System 2: ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | head -1; ps -eo pid,tid,class,rtprio,ni,pri,numa,psr,pcpu,stat,wchan,comm | egrep 'md|raid' | grep -v systemd | grep -v mlx PID TID CLS RTPRIO NI PRI NUMA PSR %CPU STAT WCHAN COMMAND 1492 1492 TS - 5 14 1 200 0.0 SN - ksmd 1560 1560 TS - -20 39 1 200 0.0 I< - md 3810 3810 TS - -20 39 0 137 0.0 I< - raid5wq 3811 3811 TS - 0 19 0 148 0.0 S - md0_raid5 3824 3824 TS - 0 19 0 167 0.0 S - md1_raid5 3929 3929 TS - 0 19 1 115 0.0 Ss - lsmd vg-cpu: %user %nice %system %iowait %steal %idle 0.58 0.00 5.61 0.29 0.00 93.51 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util md0 1118252.00 0.00 55171048.00 0.00 0.00 0.00 0.00 0.00 1.79 0.00 2002.27 49.34 0.00 0.00 100.00 md1 1262715.00 0.00 62342424.00 0.00 0.00 0.00 0.00 0.00 0.61 0.00 769.19 49.37 0.00 0.00 100.00 Jim Finlayson U.S. Department of Defense