Sequential Performance: BLUF, 1M sequential, direct I/O reads, QD 128 - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS. Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores. I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize.... I think I ran for about 40 minutes with the 1M reads... socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ... socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ... fio-3.26 Starting 128 processes fio: terminating on signal 2 socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug 9 18:53:36 2021 read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec) slat (usec): min=12, max=4463, avg=24.86, stdev=15.58 clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51 lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00 clat percentiles (msec): | 1.00th=[ 3], 5.00th=[ 5], 10.00th=[ 7], 20.00th=[ 17], | 30.00th=[ 106], 40.00th=[ 116], 50.00th=[ 209], 60.00th=[ 226], | 70.00th=[ 236], 80.00th=[ 321], 90.00th=[ 351], 95.00th=[ 372], | 99.00th=[ 472], 99.50th=[ 481], 99.90th=[ 1267], 99.95th=[ 1401], | 99.99th=[ 1586] bw ( MiB/s): min= 967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433 iops : min= 929, max=114304, avg=45879.39, stdev=330.41, samples=333433 lat (usec) : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06% lat (msec) : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48% lat (msec) : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01% lat (msec) : 2000=0.15% cpu : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug 9 18:53:36 2021 read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec) slat (usec): min=12, max=57137, avg=23.77, stdev=27.80 clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10 lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27 clat percentiles (usec): | 1.00th=[ 570], 5.00th=[ 693], 10.00th=[ 2573], | 20.00th=[ 21103], 30.00th=[ 102237], 40.00th=[ 143655], | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116], | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082], | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159], | 99.95th=[1166017], 99.99th=[1367344] bw ( MiB/s): min= 599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904 iops : min= 568, max=124809, avg=40554.92, stdev=319.34, samples=333904 lat (usec) : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60% lat (msec) : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71% lat (msec) : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10% lat (msec) : 2000=0.14% cpu : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec Run status group 1 (all jobs): READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec Disk stats (read/write): md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% -----Original Message----- From: Gal Ofri <gal.ofri@xxxxxxxxxxx> Sent: Sunday, August 8, 2021 10:44 AM To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Cc: 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx> Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? On Thu, 5 Aug 2021 21:10:40 +0000 "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@xxxxxxxx> wrote: > BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 > RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s - best hero numbers I've ever seen on mdraid RAID5 IOPS. I think the kernel patch is good. Prior was socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s.... I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control. That's great ! Thanks for sharing your results. I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here. > In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles. Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced. I change nothing but see this variance. I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now md0 is doing 5M IOPS and md1 6.3M Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that. I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching md/raid5 again soon to improve write workloads. I'll ping you when I have a patch that might be relevant. Cheers, Gal