Sorry - again..I sent HTML instead of plain text Resend - mailing list bounce All, Sorry for the delay - both work and life got into the way. Here is some feedback: BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s - best hero numbers I've ever seen on mdraid RAID5 IOPS. I think the kernel patch is good. Prior was socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s.... I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control. I need to verify the RAW IOPS - admittedly this is a different server and I didn't do any regression testing before the kernel, but my raw were socket0: 13.2M IOPS and socket1 13.5M IOPS. Prior was socket0 16.0M IOPS and socket1 13.5M IOPS. - admittedly there appears to a regression in the socket0 "hero run" but what I don't know that since this is a different server, I don't know if I have a configuration management issue in my zealousness to test this patch or whether we have a regression. I was so excited to have the attention of kernel developers that needed my help that I borrowed another system, because I didn't want to tear apart my "Frankenstein's monster" 32 partition mdraid LVM mess. If I can switch kernels and reboot before work and life get back in the way, I'll follow up.. I think I might have to give myself the action to run this to ground next week on the other server. Without a doubt the mdraid lock improvement is worth taking forward. I either have to find my error or point a finger as my raw hero numbers got worse. I tend to see one socket outrun another - the way HPE allocates the nvme drives to pcie root complexes is not how I'd like to do it so the drives are unbalanced on the PCIe root complexes (drives are in 4 different root complexes on socket 0 and 3 on socket 1, so one would think socket0 will always be faster for hero runs (an NPS4 numa mapping is the best way to show it: [root@gremlin04 hornet05]# cat *nps4 #filename=/dev/nvme0n1 0 #filename=/dev/nvme1n1 0 #filename=/dev/nvme2n1 1 #filename=/dev/nvme3n1 1 #filename=/dev/nvme4n1 2 #filename=/dev/nvme5n1 2 #filename=/dev/nvme6n1 2 #filename=/dev/nvme7n1 2 #filename=/dev/nvme8n1 3 #filename=/dev/nvme9n1 3 #filename=/dev/nvme10n1 3 #filename=/dev/nvme11n1 3 #filename=/dev/nvme12n1 4 #filename=/dev/nvme13n1 4 #filename=/dev/nvme14n1 4 #filename=/dev/nvme15n1 4 #filename=/dev/nvme17n1 5 #filename=/dev/nvme18n1 5 #filename=/dev/nvme19n1 5 #filename=/dev/nvme20n1 5 #filename=/dev/nvme21n1 6 #filename=/dev/nvme22n1 6 #filename=/dev/nvme23n1 6 #filename=/dev/nvme24n1 6 fio fiojim.hpdl385.nps1 socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ... fio-3.26 Starting 256 processes Jobs: 128 (f=128): [_(128),r(128)][1.5%][r=42.8GiB/s][r=11.2M IOPS][eta 10h:40m:00s] socket0: (groupid=0, jobs=64): err= 0: pid=522428: Thu Aug 5 19:33:05 2021 read: IOPS=13.2M, BW=50.2GiB/s (53.9GB/s)(14.7TiB/300005msec) slat (nsec): min=1312, max=8308.1k, avg=2206.72, stdev=1505.92 clat (usec): min=14, max=42033, avg=619.56, stdev=671.45 lat (usec): min=19, max=42045, avg=621.83, stdev=671.46 clat percentiles (usec): | 1.00th=[ 113], 5.00th=[ 149], 10.00th=[ 180], 20.00th=[ 229], | 30.00th=[ 273], 40.00th=[ 310], 50.00th=[ 351], 60.00th=[ 408], | 70.00th=[ 578], 80.00th=[ 938], 90.00th=[ 1467], 95.00th=[ 1909], | 99.00th=[ 3163], 99.50th=[ 4178], 99.90th=[ 5800], 99.95th=[ 6390], | 99.99th=[ 8455] bw ( MiB/s): min=28741, max=61365, per=18.56%, avg=51489.80, stdev=82.09, samples=38016 iops : min=7357916, max=15709528, avg=13181362.22, stdev=21013.83, samples=38016 lat (usec) : 20=0.01%, 50=0.02%, 100=0.42%, 250=24.52%, 500=42.21% lat (usec) : 750=7.94%, 1000=6.34% lat (msec) : 2=14.26%, 4=3.74%, 10=0.54%, 20=0.01%, 50=0.01% cpu : usr=14.58%, sys=47.48%, ctx=291912925, majf=0, minf=10492 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=3949519687,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket1: (groupid=1, jobs=64): err= 0: pid=522492: Thu Aug 5 19:33:05 2021 read: IOPS=13.6M, BW=51.8GiB/s (55.7GB/s)(15.2TiB/300004msec) slat (nsec): min=1323, max=4335.7k, avg=2242.27, stdev=1608.25 clat (usec): min=14, max=41341, avg=600.15, stdev=726.62 lat (usec): min=20, max=41358, avg=602.46, stdev=726.64 clat percentiles (usec): | 1.00th=[ 115], 5.00th=[ 151], 10.00th=[ 184], 20.00th=[ 231], | 30.00th=[ 269], 40.00th=[ 306], 50.00th=[ 347], 60.00th=[ 400], | 70.00th=[ 506], 80.00th=[ 799], 90.00th=[ 1303], 95.00th=[ 1909], | 99.00th=[ 3589], 99.50th=[ 4424], 99.90th=[ 7111], 99.95th=[ 7767], | 99.99th=[10290] bw ( MiB/s): min=28663, max=71847, per=21.11%, avg=53145.09, stdev=111.29, samples=38016 iops : min=7337860, max=18392866, avg=13605117.00, stdev=28491.19, samples=38016 lat (usec) : 20=0.01%, 50=0.02%, 100=0.36%, 250=24.52%, 500=44.77% lat (usec) : 750=8.90%, 1000=6.37% lat (msec) : 2=10.52%, 4=3.87%, 10=0.66%, 20=0.01%, 50=0.01% cpu : usr=14.86%, sys=49.40%, ctx=282634154, majf=0, minf=10276 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=4076360454,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket0-md: (groupid=2, jobs=64): err= 0: pid=524061: Thu Aug 5 19:33:05 2021 read: IOPS=5332k, BW=20.3GiB/s (21.8GB/s)(6102GiB/300002msec) slat (nsec): min=1633, max=17043k, avg=11123.38, stdev=8694.61 clat (usec): min=186, max=18705, avg=1524.87, stdev=115.29 lat (usec): min=200, max=18743, avg=1536.08, stdev=115.90 clat percentiles (usec): | 1.00th=[ 1270], 5.00th=[ 1336], 10.00th=[ 1369], 20.00th=[ 1418], | 30.00th=[ 1467], 40.00th=[ 1500], 50.00th=[ 1532], 60.00th=[ 1549], | 70.00th=[ 1582], 80.00th=[ 1631], 90.00th=[ 1680], 95.00th=[ 1713], | 99.00th=[ 1795], 99.50th=[ 1811], 99.90th=[ 1893], 99.95th=[ 1926], | 99.99th=[ 2089] bw ( MiB/s): min=19030, max=21969, per=100.00%, avg=20843.43, stdev= 5.35, samples=38272 iops : min=4871687, max=5624289, avg=5335900.01, stdev=1370.43, samples=38272 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01% cpu : usr=5.56%, sys=77.91%, ctx=8118, majf=0, minf=9018 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=1599503201,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket1-md: (groupid=3, jobs=64): err= 0: pid=524125: Thu Aug 5 19:33:05 2021 read: IOPS=5892k, BW=22.5GiB/s (24.1GB/s)(6743GiB/300002msec) slat (nsec): min=1663, max=1274.1k, avg=9896.09, stdev=7939.50 clat (usec): min=236, max=11102, avg=1379.86, stdev=148.64 lat (usec): min=239, max=11110, avg=1389.84, stdev=149.54 clat percentiles (usec): | 1.00th=[ 1106], 5.00th=[ 1172], 10.00th=[ 1205], 20.00th=[ 1254], | 30.00th=[ 1287], 40.00th=[ 1336], 50.00th=[ 1369], 60.00th=[ 1401], | 70.00th=[ 1434], 80.00th=[ 1500], 90.00th=[ 1582], 95.00th=[ 1663], | 99.00th=[ 1811], 99.50th=[ 1860], 99.90th=[ 1942], 99.95th=[ 1958], | 99.99th=[ 2040] bw ( MiB/s): min=20982, max=24535, per=-82.15%, avg=23034.61, stdev=15.46, samples=38272 iops : min=5371404, max=6281119, avg=5896843.14, stdev=3958.21, samples=38272 lat (usec) : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01% lat (msec) : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01% cpu : usr=6.55%, sys=74.98%, ctx=9833, majf=0, minf=8956 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=1767618924,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: bw=50.2GiB/s (53.9GB/s), 50.2GiB/s-50.2GiB/s (53.9GB/s-53.9GB/s), io=14.7TiB (16.2TB), run=300005-300005msec Run status group 1 (all jobs): READ: bw=51.8GiB/s (55.7GB/s), 51.8GiB/s-51.8GiB/s (55.7GB/s-55.7GB/s), io=15.2TiB (16.7TB), run=300004-300004msec Run status group 2 (all jobs): READ: bw=20.3GiB/s (21.8GB/s), 20.3GiB/s-20.3GiB/s (21.8GB/s-21.8GB/s), io=6102GiB (6552GB), run=300002-300002msec Run status group 3 (all jobs): READ: bw=22.5GiB/s (24.1GB/s), 22.5GiB/s-22.5GiB/s (24.1GB/s-24.1GB/s), io=6743GiB (7240GB), run=300002-300002msec Disk stats (read/write): nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% md0: ios=1599378656/0, merge=0/0, ticks=391992721/0, in_queue=391992721, util=100.00% md1: ios=1767484212/0, merge=0/0, ticks=427666887/0, in_queue=427666887, util=100.00% From: Gal Ofri <gal.ofri@xxxxxxxxxxx> Sent: Wednesday, July 28, 2021 5:43 AM To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>; 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. ________________________________________ A recent commit raised the limit on raid5/6 read iops. It's available in 5.14. See Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e ;< Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e ;> commit 97ae27252f4962d0fcc38ee1d9f913d817a2024e Author: Gal Ofri <gal.ofri@xxxxxxxxxx> Date: Mon Jun 7 14:07:03 2021 +0300 md/raid5: avoid device_lock in read_one_chunk() Please do share if you reach more iops in your env than described in the commit. Cheers, Gal, Volumez (formerly storing.io)