Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small fraction of it. I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful. This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant. I'm just stumped why I can't do better. If there is a fine manual that somebody can point me to, I'm happy to read it....... I have tried both RAID5 and RAID6 trying to be highly cognizant of NUMAness. The ROME is set to numas per socket to 1 and the BIOS is set to maximize infinity fabric performance and pcie performance via AMD's white papers. NVMe drives are all Gen4 (I believe HPE rebadged SAMSUNG 1733a? - I can get the drives doing 1.45M 4KB random reads each if I try hard. Everything I can think to share: [root@<server> <server>]# cat /etc/redhat-release Red Hat Enterprise Linux release 8.4 (Ootpa) [root@<server> <server>]# uname -r 4.18.0-305.el8.x86_64 root@<server> ~]# modinfo raid6 filename: /lib/modules/4.18.0-305.el8.x86_64/kernel/drivers/md/raid456.ko.xz alias: raid6 alias: raid5 alias: md-level-6 alias: md-raid6 alias: md-personality-8 alias: md-level-4 alias: md-level-5 alias: md-raid4 alias: md-raid5 alias: md-personality-4 description: RAID4/5/6 (striping with parity) personality for MD license: GPL rhelversion: 8.4 srcversion: FE86A53E1C1CDAE8F972CBA depends: async_raid6_recov,async_pq,libcrc32c,raid6_pq,async_tx,async_memcpy,async_xor intree: Y name: raid456 vermagic: 4.18.0-305.el8.x86_64 SMP mod_unload modversions sig_id: PKCS#7 signer: Red Hat Enterprise Linux kernel signing key [root@<server> ~]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT nvme16n1 259:0 0 1.8T 0 disk ├─nvme16n1p1 259:1 0 512M 0 part /boot/efi ├─nvme16n1p2 259:2 0 512M 0 part /boot ├─nvme16n1p3 259:3 0 49.4G 0 part [SWAP] └─nvme16n1p4 259:4 0 1.7T 0 part / nvme0n1 259:5 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme1n1 259:6 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme2n1 259:7 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme3n1 259:8 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme7n1 259:9 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme11n1 259:10 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme10n1 259:11 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme14n1 259:12 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme5n1 259:13 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme8n1 259:14 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme6n1 259:15 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme9n1 259:16 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme15n1 259:17 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme20n1 259:18 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme13n1 259:19 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme18n1 259:20 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme4n1 259:21 0 14T 0 disk └─md0 9:0 0 139.7T 0 raid5 nvme21n1 259:22 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme22n1 259:23 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme24n1 259:24 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme12n1 259:25 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme17n1 259:26 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme19n1 259:27 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 nvme23n1 259:28 0 14T 0 disk └─md1 9:1 0 139.7T 0 raid5 [root@<server> ~]# lsblk -o KNAME,MODEL,VENDOR KNAME MODEL VENDOR nvme0n1 MZXL515THALA-000H3 nvme1n1 MZXL515THALA-000H3 nvme2n1 MZXL515THALA-000H3 nvme3n1 MZXL515THALA-000H3 nvme7n1 MZXL515THALA-000H3 nvme11n1 MZXL515THALA-000H3 nvme10n1 MZXL515THALA-000H3 nvme14n1 MZXL515THALA-000H3 nvme5n1 MZXL515THALA-000H3 nvme8n1 MZXL515THALA-000H3 nvme6n1 MZXL515THALA-000H3 nvme9n1 MZXL515THALA-000H3 nvme15n1 MZXL515THALA-000H3 nvme20n1 MZXL515THALA-000H3 nvme13n1 MZXL515THALA-000H3 nvme18n1 MZXL515THALA-000H3 nvme4n1 MZXL515THALA-000H3 nvme21n1 MZXL515THALA-000H3 nvme22n1 MZXL515THALA-000H3 nvme24n1 MZXL515THALA-000H3 nvme12n1 MZXL515THALA-000H3 nvme17n1 MZXL515THALA-000H3 nvme19n1 MZXL515THALA-000H3 nvme23n1 MZXL515THALA-000H3 [root@<server> jim]# ./map_numa.sh (16 is the boot drive 0-11 on numa 0, 12-16,17-24 on numa 1) device: nvme8 numanode: 0 device: nvme9 numanode: 0 device: nvme10 numanode: 0 device: nvme11 numanode: 0 device: nvme4 numanode: 0 device: nvme5 numanode: 0 device: nvme6 numanode: 0 device: nvme7 numanode: 0 device: nvme2 numanode: 0 device: nvme3 numanode: 0 device: nvme0 numanode: 0 device: nvme1 numanode: 0 device: nvme21 numanode: 1 device: nvme22 numanode: 1 device: nvme23 numanode: 1 device: nvme24 numanode: 1 device: nvme16 numanode: 1 device: nvme17 numanode: 1 device: nvme18 numanode: 1 device: nvme19 numanode: 1 device: nvme20 numanode: 1 device: nvme14 numanode: 1 device: nvme15 numanode: 1 device: nvme12 numanode: 1 device: nvme13 numanode: 1 [root@<server> jim]# cat /etc/udev/rules.d/99-abj.nr_32.rules KERNEL=="nvme*[0-9]n*[0-9]",ATTRS{model}=="MZXL515THALA-000H3",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096",PROGRAM="/usr/sbin/nvme set-feature /dev/%k --feature-id 8 --value 522 " {coalesce up to 10 interrupts per device} SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", ATTR{md/sync_speed_max}="2000000",ATTR{md/group_thread_cnt}="64", ATTR{md/stripe_cache_size}="8192",ATTR{queue/io_poll}="1",ATTR{queue/io_poll_delay}="100000",ATTR{queue/nomerges}="2",ATTR{queue/nr_requests}="1023",ATTR{queue/rotational}="0",ATTR{queue/rq_affinity}="2",ATTR{queue/scheduler}="none",ATTR{queue/add_random}="0",ATTR{queue/max_sectors_kb}="4096" (I know the 1023 doesn't work in the md, but there for reference) - we tune for max iops, not for latency, thus the going hard at rq_affinity, nomerges, etc..... [root@<server> <server>]# cat /proc/mdstat (128K chunk is just something Fusion IO told me way back when and never needed to change) Personalities : [raid6] [raid5] [raid4] md0 : active raid5 nvme11n1[11](S) nvme10n1[10] nvme9n1[9] nvme8n1[8] nvme7n1[7] nvme6n1[6] nvme5n1[5] nvme4n1[4] nvme3n1[3] nvme2n1[2] nvme1n1[1] nvme0n1[0] 150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU] bitmap: 0/112 pages [0KB], 65536KB chunk md1 : active raid5 nvme24n1[11](S) nvme23n1[10] nvme22n1[9] nvme21n1[8] nvme20n1[7] nvme19n1[6] nvme18n1[5] nvme17n1[4] nvme15n1[3] nvme14n1[2] nvme13n1[1] nvme12n1[0] 150007961600 blocks super 1.2 level 5, 128k chunk, algorithm 2 [11/11] [UUUUUUUUUUU] bitmap: 0/112 pages [0KB], 65536KB chunk unused devices: <none> [root@<server> /]# grep raid /var/log/messages......What troubles me is if mdraid checked parity on read, I could somewhat understand, but I would think the reads are nearly a pass through.... Jul 27 00:00:02 <server> rpmlist_verification[12745]: libblockdev-mdraid 2.24 Thu 22 Jul 2021 02:58:37 PM GMT Jul 27 18:00:28 <server> kernel: raid6: sse2x1 gen() 9792 MB/s Jul 27 18:00:28 <server> kernel: raid6: sse2x1 xor() 6436 MB/s Jul 27 18:00:28 <server> kernel: raid6: sse2x2 gen() 11198 MB/s Jul 27 18:00:28 <server> kernel: raid6: sse2x2 xor() 9546 MB/s Jul 27 18:00:28 <server> kernel: raid6: sse2x4 gen() 14271 MB/s Jul 27 18:00:29 <server> kernel: raid6: sse2x4 xor() 6354 MB/s Jul 27 18:00:29 <server> kernel: raid6: avx2x1 gen() 22838 MB/s Jul 27 18:00:29 <server> kernel: raid6: avx2x1 xor() 14069 MB/s Jul 27 18:00:29 <server> kernel: raid6: avx2x2 gen() 26973 MB/s Jul 27 18:00:29 <server> kernel: raid6: avx2x2 xor() 18380 MB/s Jul 27 18:00:29 <server> kernel: raid6: avx2x4 gen() 26601 MB/s Jul 27 18:00:29 <server> kernel: raid6: avx2x4 xor() 7025 MB/s Jul 27 18:00:29 <server> kernel: raid6: using algorithm avx2x2 gen() 26973 MB/s Jul 27 18:00:29 <server> kernel: raid6: .... xor() 18380 MB/s, rmw enabled Jul 27 18:00:29 <server> kernel: raid6: using avx2x2 recovery algorithm [root@<server> <server>]# cat fiojim.hpdl385.nps1 [global] name=random iodepth=128 ioengine=libaio direct=1 norandommap group_reporting randrepeat=1 random_generator=tausworthe64 bs=4k rw=randread numjobs=64 runtime=60 [socket0] new_group numa_mem_policy=bind:0 numa_cpu_nodes=0 filename=/dev/nvme0n1 filename=/dev/nvme1n1 filename=/dev/nvme2n1 filename=/dev/nvme3n1 filename=/dev/nvme4n1 filename=/dev/nvme5n1 filename=/dev/nvme6n1 filename=/dev/nvme7n1 filename=/dev/nvme8n1 filename=/dev/nvme9n1 filename=/dev/nvme10n1 filename=/dev/nvme11n1 [socket1] new_group numa_mem_policy=bind:1 numa_cpu_nodes=1 filename=/dev/nvme12n1 filename=/dev/nvme13n1 filename=/dev/nvme14n1 filename=/dev/nvme15n1 filename=/dev/nvme17n1 filename=/dev/nvme18n1 filename=/dev/nvme19n1 filename=/dev/nvme20n1 filename=/dev/nvme21n1 filename=/dev/nvme22n1 filename=/dev/nvme23n1 filename=/dev/nvme24n1 [socket0-md] stonewall numa_mem_policy=bind:0 numa_cpu_nodes=0 filename=/dev/md0 [socket1-md] new_group numa_mem_policy=bind:1 numa_cpu_nodes=1 filename=/dev/md1 iostat -xkz 1 with the drives raw: avg-cpu: %user %nice %system %iowait %steal %idle 8.32 0.00 38.30 0.00 0.00 53.39 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme0n1 1317510.00 0.00 5270044.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 411.95 4.00 0.00 0.00 100.40 nvme1n1 1317548.00 0.00 5270192.00 0.00 0.00 0.00 0.00 0.00 0.32 0.00 417.38 4.00 0.00 0.00 100.00 nvme2n1 1317578.00 0.00 5270316.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 414.77 4.00 0.00 0.00 100.20 nvme3n1 1317554.00 0.00 5270216.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 413.25 4.00 0.00 0.00 100.40 nvme7n1 1317559.00 0.00 5270236.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 430.03 4.00 0.00 0.00 100.40 nvme11n1 1317502.00 0.00 5269996.00 0.00 0.00 0.00 0.00 0.00 0.73 0.00 964.85 4.00 0.00 0.00 100.40 nvme10n1 1317656.00 0.00 5270624.00 0.00 0.00 0.00 0.00 0.00 0.80 0.00 1050.05 4.00 0.00 0.00 100.40 nvme14n1 1107632.00 0.00 4430528.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.52 4.00 0.00 0.00 100.40 nvme5n1 1317583.00 0.00 5270332.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 430.47 4.00 0.00 0.00 100.00 nvme8n1 1317617.00 0.00 5270468.00 0.00 0.00 0.00 0.00 0.00 0.74 0.00 972.52 4.00 0.00 0.00 101.00 nvme6n1 1317535.00 0.00 5270144.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 432.48 4.00 0.00 0.00 100.60 nvme9n1 1317582.00 0.00 5270328.00 0.00 0.00 0.00 0.00 0.00 0.75 0.00 992.82 4.00 0.00 0.00 100.40 nvme15n1 1107703.00 0.00 4430816.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 305.93 4.00 0.00 0.00 100.60 nvme20n1 1107712.00 0.00 4430848.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 306.72 4.00 0.00 0.00 100.20 nvme13n1 1107714.00 0.00 4430852.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.10 4.00 0.00 0.00 101.40 nvme18n1 1107674.00 0.00 4430696.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 306.04 4.00 0.00 0.00 100.20 nvme4n1 1317521.00 0.00 5270076.00 0.00 0.00 0.00 0.00 0.00 0.33 0.00 431.63 4.00 0.00 0.00 100.20 nvme21n1 1107714.00 0.00 4430856.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 309.11 4.00 0.00 0.00 100.40 nvme22n1 1107711.00 0.00 4430840.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 308.52 4.00 0.00 0.00 100.60 nvme24n1 1107441.00 0.00 4429768.00 0.00 0.00 0.00 0.00 0.00 3.86 0.00 4271.29 4.00 0.00 0.00 100.20 nvme12n1 1107733.00 0.00 4430932.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.70 4.00 0.00 0.00 100.40 nvme17n1 1107858.00 0.00 4431436.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.95 4.00 0.00 0.00 100.60 nvme19n1 1107766.00 0.00 4431064.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00 307.17 4.00 0.00 0.00 100.40 nvme23n1 1108033.00 0.00 4432132.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 340.62 4.00 0.00 0.00 100.00 iostat -xkz 1 with the md's avg-cpu: %user %nice %system %iowait %steal %idle 0.56 0.00 49.94 0.00 0.00 49.51 Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util nvme0n1 114589.00 0.00 458356.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.54 4.00 0.00 0.01 100.00 nvme1n1 115284.00 0.00 461136.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.77 4.00 0.00 0.01 100.00 nvme2n1 114911.00 0.00 459644.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00 nvme3n1 114538.00 0.00 458152.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.55 4.00 0.00 0.01 100.00 nvme7n1 114524.00 0.00 458096.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.53 4.00 0.00 0.01 100.00 nvme10n1 114934.00 0.00 459736.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00 nvme14n1 97399.00 0.00 389596.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.41 4.00 0.00 0.01 100.00 nvme5n1 114929.00 0.00 459716.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.61 4.00 0.00 0.01 100.00 nvme8n1 114393.00 0.00 457572.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.40 4.00 0.00 0.01 99.90 nvme6n1 114731.00 0.00 458924.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.56 4.00 0.00 0.01 99.90 nvme9n1 114146.00 0.00 456584.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.37 4.00 0.00 0.01 99.90 nvme15n1 96960.00 0.00 387840.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.30 4.00 0.00 0.01 100.00 nvme20n1 97171.00 0.00 388684.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.36 4.00 0.00 0.01 100.00 nvme13n1 96874.00 0.00 387496.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.31 4.00 0.00 0.01 100.00 nvme18n1 96696.00 0.00 386784.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.16 4.00 0.00 0.01 100.00 nvme4n1 115220.00 0.00 460876.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.75 4.00 0.00 0.01 100.00 nvme21n1 96756.00 0.00 387024.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.24 4.00 0.00 0.01 100.00 nvme22n1 97352.00 0.00 389408.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.36 4.00 0.00 0.01 100.00 nvme12n1 96899.00 0.00 387596.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.22 4.00 0.00 0.01 100.20 nvme17n1 96748.00 0.00 386992.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.24 4.00 0.00 0.01 100.00 nvme19n1 97191.00 0.00 388764.00 0.00 0.00 0.00 0.00 0.00 0.30 0.00 29.30 4.00 0.00 0.01 100.00 nvme23n1 96787.00 0.00 387148.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 28.41 4.00 0.00 0.01 99.90 md1 1066812.00 0.00 4267248.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 md0 1262173.00 0.00 5048692.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 fio output: socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... fio-3.26 Starting 256 processes Jobs: 128 (f=128): [_(128),r(128)][1.6%][r=9103MiB/s][r=2330k IOPS][eta 02h:08m:00s] socket0: (groupid=0, jobs=64): err= 0: pid=18344: Tue Jul 27 20:00:10 2021 read: IOPS=16.0M, BW=60.8GiB/s (65.3GB/s)(3651GiB/60003msec) slat (nsec): min=1222, max=18033k, avg=2429.23, stdev=2975.48 clat (usec): min=24, max=20221, avg=510.51, stdev=336.57 lat (usec): min=30, max=20240, avg=513.01, stdev=336.58 clat percentiles (usec): | 1.00th=[ 147], 5.00th=[ 194], 10.00th=[ 229], 20.00th=[ 281], | 30.00th=[ 326], 40.00th=[ 367], 50.00th=[ 412], 60.00th=[ 469], | 70.00th=[ 553], 80.00th=[ 676], 90.00th=[ 914], 95.00th=[ 1156], | 99.00th=[ 1778], 99.50th=[ 2073], 99.90th=[ 2868], 99.95th=[ 3294], | 99.99th=[ 4424] bw ( MiB/s): min=52367, max=65429, per=32.81%, avg=62388.68, stdev=33.73, samples=7424 iops : min=13406054, max=16749890, avg=15971477.42, stdev=8635.86, samples=7424 lat (usec) : 50=0.01%, 100=0.02%, 250=13.89%, 500=50.33%, 750=19.72% lat (usec) : 1000=8.24% lat (msec) : 2=7.22%, 4=0.57%, 10=0.02%, 20=0.01%, 50=0.01% cpu : usr=17.93%, sys=49.30%, ctx=21719222, majf=0, minf=9915 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=957111950,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket1: (groupid=1, jobs=64): err= 0: pid=18408: Tue Jul 27 20:00:10 2021 read: IOPS=13.5M, BW=51.4GiB/s (55.2GB/s)(3085GiB/60008msec) slat (nsec): min=1232, max=1696.9k, avg=2580.28, stdev=2841.95 clat (usec): min=21, max=26808, avg=604.58, stdev=1211.79 lat (usec): min=26, max=26810, avg=607.23, stdev=1211.80 clat percentiles (usec): | 1.00th=[ 124], 5.00th=[ 157], 10.00th=[ 184], 20.00th=[ 225], | 30.00th=[ 258], 40.00th=[ 289], 50.00th=[ 318], 60.00th=[ 351], | 70.00th=[ 388], 80.00th=[ 437], 90.00th=[ 586], 95.00th=[ 2769], | 99.00th=[ 6587], 99.50th=[ 9372], 99.90th=[12649], 99.95th=[13829], | 99.99th=[16712] bw ( MiB/s): min=32950, max=67704, per=20.46%, avg=52713.11, stdev=106.96, samples=7424 iops : min=8435402, max=17332350, avg=13494532.64, stdev=27383.02, samples=7424 lat (usec) : 50=0.01%, 100=0.16%, 250=27.38%, 500=59.09%, 750=4.93% lat (usec) : 1000=0.30% lat (msec) : 2=0.60%, 4=5.67%, 10=1.47%, 20=0.39%, 50=0.01% cpu : usr=14.86%, sys=45.29%, ctx=36050249, majf=0, minf=10046 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=808781317,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket0-md: (groupid=2, jobs=64): err= 0: pid=18479: Tue Jul 27 20:00:10 2021 read: IOPS=1263k, BW=4934MiB/s (5174MB/s)(289GiB/60001msec) slat (nsec): min=1512, max=48037k, avg=49957.85, stdev=33615.19 clat (usec): min=176, max=51614, avg=6432.56, stdev=410.54 lat (usec): min=178, max=51639, avg=6482.58, stdev=412.23 clat percentiles (usec): | 1.00th=[ 6128], 5.00th=[ 6259], 10.00th=[ 6325], 20.00th=[ 6325], | 30.00th=[ 6390], 40.00th=[ 6390], 50.00th=[ 6456], 60.00th=[ 6456], | 70.00th=[ 6521], 80.00th=[ 6521], 90.00th=[ 6587], 95.00th=[ 6587], | 99.00th=[ 6652], 99.50th=[ 6718], 99.90th=[ 7635], 99.95th=[16909], | 99.99th=[18220] bw ( MiB/s): min= 4582, max= 5934, per=100.00%, avg=4938.25, stdev= 2.07, samples=7616 iops : min=1173219, max=1519297, avg=1264175.97, stdev=528.77, samples=7616 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.34%, 10=99.57%, 20=0.08%, 50=0.01% lat (msec) : 100=0.01% cpu : usr=1.23%, sys=95.69%, ctx=2557, majf=0, minf=9064 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=75789817,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket1-md: (groupid=3, jobs=64): err= 0: pid=18543: Tue Jul 27 20:00:10 2021 read: IOPS=1071k, BW=4183MiB/s (4386MB/s)(245GiB/60002msec) slat (nsec): min=1563, max=14080k, avg=59051.10, stdev=22401.39 clat (usec): min=179, max=20799, avg=7588.23, stdev=303.92 lat (usec): min=211, max=20853, avg=7647.34, stdev=305.26 clat percentiles (usec): | 1.00th=[ 7111], 5.00th=[ 7373], 10.00th=[ 7439], 20.00th=[ 7504], | 30.00th=[ 7504], 40.00th=[ 7570], 50.00th=[ 7570], 60.00th=[ 7635], | 70.00th=[ 7635], 80.00th=[ 7701], 90.00th=[ 7767], 95.00th=[ 7767], | 99.00th=[ 7898], 99.50th=[ 7898], 99.90th=[ 8586], 99.95th=[13304], | 99.99th=[19006] bw ( MiB/s): min= 3955, max= 4642, per=100.00%, avg=4186.20, stdev= 0.98, samples=7616 iops : min=1012714, max=1188416, avg=1071653.68, stdev=251.68, samples=7616 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.01%, 4=0.01%, 10=99.94%, 20=0.05%, 50=0.01% cpu : usr=1.06%, sys=95.70%, ctx=1980, majf=0, minf=9030 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=64246431,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: bw=60.8GiB/s (65.3GB/s), 60.8GiB/s-60.8GiB/s (65.3GB/s-65.3GB/s), io=3651GiB (3920GB), run=60003-60003msec Run status group 1 (all jobs): READ: bw=51.4GiB/s (55.2GB/s), 51.4GiB/s-51.4GiB/s (55.2GB/s-55.2GB/s), io=3085GiB (3313GB), run=60008-60008msec Run status group 2 (all jobs): READ: bw=4934MiB/s (5174MB/s), 4934MiB/s-4934MiB/s (5174MB/s-5174MB/s), io=289GiB (310GB), run=60001-60001msec Run status group 3 (all jobs): READ: bw=4183MiB/s (4386MB/s), 4183MiB/s-4183MiB/s (4386MB/s-4386MB/s), io=245GiB (263GB), run=60002-60002msec Disk stats (read/write): nvme0n1: ios=79463384/0, merge=0/0, ticks=25148472/0, in_queue=25148472, util=98.78% nvme1n1: ios=79463574/0, merge=0/0, ticks=25224784/0, in_queue=25224784, util=98.87% nvme2n1: ios=79463699/0, merge=0/0, ticks=25305193/0, in_queue=25305193, util=98.96% nvme3n1: ios=79463925/0, merge=0/0, ticks=25234093/0, in_queue=25234093, util=99.00% nvme4n1: ios=79464135/0, merge=0/0, ticks=25396547/0, in_queue=25396547, util=99.06% nvme5n1: ios=79464346/0, merge=0/0, ticks=25393624/0, in_queue=25393624, util=99.10% nvme6n1: ios=79464535/0, merge=0/0, ticks=25330700/0, in_queue=25330700, util=99.19% nvme7n1: ios=79464721/0, merge=0/0, ticks=25349171/0, in_queue=25349171, util=99.24% nvme8n1: ios=79464029/0, merge=0/0, ticks=59063115/0, in_queue=59063115, util=99.32% nvme9n1: ios=79464120/0, merge=0/0, ticks=59023913/0, in_queue=59023913, util=99.33% nvme10n1: ios=79464799/0, merge=0/0, ticks=59136926/0, in_queue=59136927, util=99.39% nvme11n1: ios=79465392/0, merge=0/0, ticks=59091104/0, in_queue=59091104, util=99.51% nvme12n1: ios=67137057/0, merge=0/0, ticks=18685135/0, in_queue=18685136, util=99.60% nvme13n1: ios=67137217/0, merge=0/0, ticks=18638940/0, in_queue=18638940, util=99.76% nvme14n1: ios=67137341/0, merge=0/0, ticks=18663275/0, in_queue=18663275, util=99.70% nvme15n1: ios=67137620/0, merge=0/0, ticks=18629947/0, in_queue=18629948, util=99.77% nvme17n1: ios=67137778/0, merge=0/0, ticks=18709586/0, in_queue=18709585, util=99.80% nvme18n1: ios=67137952/0, merge=0/0, ticks=18591798/0, in_queue=18591797, util=99.72% nvme19n1: ios=67138199/0, merge=0/0, ticks=18669545/0, in_queue=18669545, util=99.86% nvme20n1: ios=67138378/0, merge=0/0, ticks=18600128/0, in_queue=18600128, util=99.89% nvme21n1: ios=67138562/0, merge=0/0, ticks=18720763/0, in_queue=18720763, util=100.00% nvme22n1: ios=67138772/0, merge=0/0, ticks=18659716/0, in_queue=18659716, util=100.00% nvme23n1: ios=67138982/0, merge=0/0, ticks=27862395/0, in_queue=27862395, util=100.00% nvme24n1: ios=67134934/0, merge=0/0, ticks=241977879/0, in_queue=241977879, util=100.00% md0: ios=75701982/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% md1: ios=64175011/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% I'm used to tuning interrupts, so here are the interrupts during the hero portion of the fio and the mdraid portion.....Without polling they are just well balanced irq's across the different nvme MQs [root@<server> jim]# ./top-irq.pl -k 1 reporting top 10 every 6 secs subject to thresh=10 kernel=1 CAL 532284 CPU146 Function call interrupts CAL 529615 CPU154 Function call interrupts CAL 526198 CPU162 Function call interrupts CAL 524012 CPU142 Function call interrupts CAL 521467 CPU174 Function call interrupts CAL 520821 CPU178 Function call interrupts CAL 518798 CPU176 Function call interrupts CAL 518244 CPU166 Function call interrupts CAL 517524 CPU180 Function call interrupts CAL 514563 CPU136 Function call interrupts reported top 10 (of 1885) reported interrupts = 5223526 870587.7 per sec 6.8% of all interrupts ^C [root@<server> jim]# !! ./top-irq.pl -k 1 reporting top 10 every 6 secs subject to thresh=10 kernel=1 CAL 63759 CPU15 Function call interrupts CAL 63664 CPU178 Function call interrupts CAL 63428 CPU142 Function call interrupts CAL 63382 CPU51 Function call interrupts CAL 63285 CPU140 Function call interrupts CAL 63068 CPU150 Function call interrupts CAL 63017 CPU148 Function call interrupts CAL 62984 CPU144 Function call interrupts CAL 62842 CPU25 Function call interrupts CAL 62835 CPU37 Function call interrupts reported top 10 (of 1885) reported interrupts = 632264 105377.3 per sec 4.0% of all interrupts Lastly, I can't make md0 and md1 each get ~2M IOPS at the same time. Sometimes the NUMA0 md is the fastest, sometimes the NUMA1 md is the fastest - I think there might some sort of bottleneck/race somewhere. It stays that way until I stop them and reassemble.....and then it may switch. I haven't troubleshooted enough to notice the pattern. I have to work out with HPE why the socket0/socket1 difference in hero numbers 16.0M/13.5M is something I'll have to take up with HPE or maybe there is a card slowing down the drives in socket1. Any help is greatly appreciated. Criticism will be accepted and worst case, IF I HAVEN'T MISSED SOMETHING SO UTTERLY SILLY, this becomes a defacto "where to start" for the base users like me before the kernel level experts get involved. As an FYI - I have booted a 5.13 kernel and started using io_uring - no noticeable difference in md performance on a different server with GEN3 drives.....I can raise my "hero numbers" when I have time to play, but right now, my job is to get protected IOPS. Jim Finlayson U.S. Department of Defense