Final spray from me for a few days. In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles. Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced. I change nothing but see this variance. I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now md0 is doing 5M IOPS and md1 6.3M - still totaling in the low 11's but quite the disparity. Am I missing a tuning knob? I shared everything I do and know in the earlier thread. I just want to point this out while I have attention. I've have seen this behavior over and over again. The more I know about AMD, the more I think I can't depend on the HPC profile provided and I need to take full control of the BIOS. I know there is a ton of power management going on under the covers, so maybe that is what I'm experiencing. The more I type, the more I think I don't see it on Intel, but I don't have a modern Intel machine with modern SSDs to test. I'll accept that there is nothing inherent in mdraid or the kernel to cause this and put my attention to the BIOS if the experts can confirm.... -----Original Message----- From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Sent: Thursday, August 5, 2021 4:50 PM To: 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx> Cc: 'Gal Ofri' <gal.ofri@xxxxxxxxxxx>; Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? As far as the slower hero numbers - false alarm on my part - rebooted with 4.18 RHEL 8.4 kernel Socket0 hero - 13.2M IOPS, Socket1 hero 13.7M IOPS. I have to figure out the differences either between my drives or my server. Chances are, slot for slot I have PCIe cards that are in different slots between the two servers if I had to guess.... As a major flag though - with mdraid volumes I created under the 5.14rc3 kernel, I lock the system up solid when I try to access them under 4.18.....I'm not an expert on forcing NMI's and getting the stack traces, so I might have to leave that to others.....After two lockups, I returned to the 5.14 kernel. If I need to run something - you have seen the config I have - I'm willing. I'm willing to push as hard as I can and to run anything that can help as long as it isn't urgent - I have a day job and have some constraints as a civil servant, however, I have the researcher, push push push mindset. I want to really encourage the community to push as hard as possible on protected IOPS and I'm willing to help however I can....In my interactions with the processor and server OEMs - I'm encouraging them to get the Linux leaders in I/O development, the biggest baddest Server/SSD combinations they have early in the development. I know they won't listen to me but I'm trying to help. For those of you on Rome server, get with your server provider. There are some things in the BIOS that can be tweaked for I/O. -----Original Message----- From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Sent: Thursday, August 5, 2021 3:52 PM To: 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx> Cc: 'Gal Ofri' <gal.ofri@xxxxxxxxxxx>; Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Sorry - again..I sent HTML instead of plain text Resend - mailing list bounce All, Sorry for the delay - both work and life got into the way. Here is some feedback: BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s - best hero numbers I've ever seen on mdraid RAID5 IOPS. I think the kernel patch is good. Prior was socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s.... I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control. I need to verify the RAW IOPS - admittedly this is a different server and I didn't do any regression testing before the kernel, but my raw were socket0: 13.2M IOPS and socket1 13.5M IOPS. Prior was socket0 16.0M IOPS and socket1 13.5M IOPS. - admittedly there appears to a regression in the socket0 "hero run" but what I don't know that since this is a different server, I don't know if I have a configuration management issue in my zealousness to test this patch or whether we have a regression. I was so excited to have the attention of kernel developers that needed my help that I borrowed another system, because I didn't want to tear apart my "Frankenstein's monster" 32 partition mdraid LVM mess. If I can switch kernels and reboot before work and life get back in the way, I'll follow up.. I think I might have to give myself the action to run this to ground next week on the other server. Without a doubt the mdraid lock improvement is worth taking forward. I either have to find my error or point a finger as my raw hero numbers got worse. I tend to see one socket outrun another - the way HPE allocates the nvme drives to pcie root complexes is not how I'd like to do it so the drives are unbalanced on the PCIe root complexes (drives are in 4 different root complexes on socket 0 and 3 on socket 1, so one would think socket0 will always be faster for hero runs (an NPS4 numa mapping is the best way to show it: [root@gremlin04 hornet05]# cat *nps4 #filename=/dev/nvme0n1 0 #filename=/dev/nvme1n1 0 #filename=/dev/nvme2n1 1 #filename=/dev/nvme3n1 1 #filename=/dev/nvme4n1 2 #filename=/dev/nvme5n1 2 #filename=/dev/nvme6n1 2 #filename=/dev/nvme7n1 2 #filename=/dev/nvme8n1 3 #filename=/dev/nvme9n1 3 #filename=/dev/nvme10n1 3 #filename=/dev/nvme11n1 3 #filename=/dev/nvme12n1 4 #filename=/dev/nvme13n1 4 #filename=/dev/nvme14n1 4 #filename=/dev/nvme15n1 4 #filename=/dev/nvme17n1 5 #filename=/dev/nvme18n1 5 #filename=/dev/nvme19n1 5 #filename=/dev/nvme20n1 5 #filename=/dev/nvme21n1 6 #filename=/dev/nvme22n1 6 #filename=/dev/nvme23n1 6 #filename=/dev/nvme24n1 6 fio fiojim.hpdl385.nps1 socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... fio-3.26 Starting 256 processes Jobs: 128 (f=128): [_(128),r(128)][1.5%][r=42.8GiB/s][r=11.2M IOPS][eta 10h:40m:00s] socket0: (groupid=0, jobs=64): err= 0: pid=522428: Thu Aug 5 19:33:05 2021 read: IOPS=13.2M, BW=50.2GiB/s (53.9GB/s)(14.7TiB/300005msec) slat (nsec): min=1312, max=8308.1k, avg=2206.72, stdev=1505.92 clat (usec): min=14, max=42033, avg=619.56, stdev=671.45 lat (usec): min=19, max=42045, avg=621.83, stdev=671.46 clat percentiles (usec): | 1.00th=[ 113], 5.00th=[ 149], 10.00th=[ 180], 20.00th=[ 229], | 30.00th=[ 273], 40.00th=[ 310], 50.00th=[ 351], 60.00th=[ 408], | 70.00th=[ 578], 80.00th=[ 938], 90.00th=[ 1467], 95.00th=[ 1909], | 99.00th=[ 3163], 99.50th=[ 4178], 99.90th=[ 5800], 99.95th=[ 6390], | 99.99th=[ 8455] bw ( MiB/s): min=28741, max=61365, per=18.56%, avg=51489.80, stdev=82.09, samples=38016 iops : min=7357916, max=15709528, avg=13181362.22, stdev=21013.83, samples=38016 lat (usec) : 20=0.01%, 50=0.02%, 100=0.42%, 250=24.52%, 500=42.21% lat (usec) : 750=7.94%, 1000=6.34% lat (msec) : 2=14.26%, 4=3.74%, 10=0.54%, 20=0.01%, 50=0.01% cpu : usr=14.58%, sys=47.48%, ctx=291912925, majf=0, minf=10492 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=3949519687,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket1: (groupid=1, jobs=64): err= 0: pid=522492: Thu Aug 5 19:33:05 2021 read: IOPS=13.6M, BW=51.8GiB/s (55.7GB/s)(15.2TiB/300004msec) slat (nsec): min=1323, max=4335.7k, avg=2242.27, stdev=1608.25 clat (usec): min=14, max=41341, avg=600.15, stdev=726.62 lat (usec): min=20, max=41358, avg=602.46, stdev=726.64 clat percentiles (usec): | 1.00th=[ 115], 5.00th=[ 151], 10.00th=[ 184], 20.00th=[ 231], | 30.00th=[ 269], 40.00th=[ 306], 50.00th=[ 347], 60.00th=[ 400], | 70.00th=[ 506], 80.00th=[ 799], 90.00th=[ 1303], 95.00th=[ 1909], | 99.00th=[ 3589], 99.50th=[ 4424], 99.90th=[ 7111], 99.95th=[ 7767], | 99.99th=[10290] bw ( MiB/s): min=28663, max=71847, per=21.11%, avg=53145.09, stdev=111.29, samples=38016 iops : min=7337860, max=18392866, avg=13605117.00, stdev=28491.19, samples=38016 lat (usec) : 20=0.01%, 50=0.02%, 100=0.36%, 250=24.52%, 500=44.77% lat (usec) : 750=8.90%, 1000=6.37% lat (msec) : 2=10.52%, 4=3.87%, 10=0.66%, 20=0.01%, 50=0.01% cpu : usr=14.86%, sys=49.40%, ctx=282634154, majf=0, minf=10276 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=4076360454,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket0-md: (groupid=2, jobs=64): err= 0: pid=524061: Thu Aug 5 19:33:05 2021 read: IOPS=5332k, BW=20.3GiB/s (21.8GB/s)(6102GiB/300002msec) slat (nsec): min=1633, max=17043k, avg=11123.38, stdev=8694.61 clat (usec): min=186, max=18705, avg=1524.87, stdev=115.29 lat (usec): min=200, max=18743, avg=1536.08, stdev=115.90 clat percentiles (usec): | 1.00th=[ 1270], 5.00th=[ 1336], 10.00th=[ 1369], 20.00th=[ 1418], | 30.00th=[ 1467], 40.00th=[ 1500], 50.00th=[ 1532], 60.00th=[ 1549], | 70.00th=[ 1582], 80.00th=[ 1631], 90.00th=[ 1680], 95.00th=[ 1713], | 99.00th=[ 1795], 99.50th=[ 1811], 99.90th=[ 1893], 99.95th=[ 1926], | 99.99th=[ 2089] bw ( MiB/s): min=19030, max=21969, per=100.00%, avg=20843.43, stdev= 5.35, samples=38272 iops : min=4871687, max=5624289, avg=5335900.01, stdev=1370.43, samples=38272 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01% cpu : usr=5.56%, sys=77.91%, ctx=8118, majf=0, minf=9018 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=1599503201,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket1-md: (groupid=3, jobs=64): err= 0: pid=524125: Thu Aug 5 19:33:05 2021 read: IOPS=5892k, BW=22.5GiB/s (24.1GB/s)(6743GiB/300002msec) slat (nsec): min=1663, max=1274.1k, avg=9896.09, stdev=7939.50 clat (usec): min=236, max=11102, avg=1379.86, stdev=148.64 lat (usec): min=239, max=11110, avg=1389.84, stdev=149.54 clat percentiles (usec): | 1.00th=[ 1106], 5.00th=[ 1172], 10.00th=[ 1205], 20.00th=[ 1254], | 30.00th=[ 1287], 40.00th=[ 1336], 50.00th=[ 1369], 60.00th=[ 1401], | 70.00th=[ 1434], 80.00th=[ 1500], 90.00th=[ 1582], 95.00th=[ 1663], | 99.00th=[ 1811], 99.50th=[ 1860], 99.90th=[ 1942], 99.95th=[ 1958], | 99.99th=[ 2040] bw ( MiB/s): min=20982, max=24535, per=-82.15%, avg=23034.61, stdev=15.46, samples=38272 iops : min=5371404, max=6281119, avg=5896843.14, stdev=3958.21, samples=38272 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01% cpu : usr=6.55%, sys=74.98%, ctx=9833, majf=0, minf=8956 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=1767618924,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: bw=50.2GiB/s (53.9GB/s), 50.2GiB/s-50.2GiB/s (53.9GB/s-53.9GB/s), io=14.7TiB (16.2TB), run=300005-300005msec Run status group 1 (all jobs): READ: bw=51.8GiB/s (55.7GB/s), 51.8GiB/s-51.8GiB/s (55.7GB/s-55.7GB/s), io=15.2TiB (16.7TB), run=300004-300004msec Run status group 2 (all jobs): READ: bw=20.3GiB/s (21.8GB/s), 20.3GiB/s-20.3GiB/s (21.8GB/s-21.8GB/s), io=6102GiB (6552GB), run=300002-300002msec Run status group 3 (all jobs): READ: bw=22.5GiB/s (24.1GB/s), 22.5GiB/s-22.5GiB/s (24.1GB/s-24.1GB/s), io=6743GiB (7240GB), run=300002-300002msec Disk stats (read/write): nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% md0: ios=1599378656/0, merge=0/0, ticks=391992721/0, in_queue=391992721, util=100.00% md1: ios=1767484212/0, merge=0/0, ticks=427666887/0, in_queue=427666887, util=100.00% From: Gal Ofri <gal.ofri@xxxxxxxxxxx> Sent: Wednesday, July 28, 2021 5:43 AM To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>; 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. ________________________________________ A recent commit raised the limit on raid5/6 read iops. It's available in 5.14. See Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e ;< Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e ;> commit 97ae27252f4962d0fcc38ee1d9f913d817a2024e Author: Gal Ofri <gal.ofri@xxxxxxxxxx> Date: Mon Jun 7 14:07:03 2021 +0300 md/raid5: avoid device_lock in read_one_chunk() Please do share if you reach more iops in your env than described in the commit. Cheers, Gal, Volumez (formerly storing.io)