All, A quick random performance update (this is the best I can do in "going for it" with all of the guidance from this list) - I'm thrilled..... 5.14rc4 kernel Gen 4 drives, all AMD Rome BIOS tuning to keep I/O from power throttling, SMT turned on (off yielded higher performance but left no room for anything else), 15.36TB drives cut into 32 equal partitions, 32 NUMA aligned raid5 9+1s from the same partition on NUMA0 combined with an LVM concatenating all 32 RAID5's into one volume. I then do the exact same thing on NUMA1. 4K random reads, SMT off, sustained bandwidth of > 90GB/s, sustained IOPS across both LVMs, ~23M - bad part, only 7% of the system left to do anything useful 4K random reads, SMT on, sustained bandwidth of > 84GB/s, sustained IOPS across both LVMs, ~21M - 46.7% idle (.73% users, 52.6% system time) Takeaway - IMHO, no reason to turn off SMT, it helps way more than it hurts... Without the partitioning and lvm shenanigans, with SMT on, 5.14rc4 kernel, most AMD BIOS tuning (not all), I'm at 46GB/s, 11.7M IOPS , 42.2% idle (3% user, 54.7% system time) With stock RHEL 8.4, 4.18 kernel, SMT on, both partitioning and LVM shenanigans, most AMD BIOS tuning (not all), I'm at 81.5GB/s, 20.4M IOPS, 49% idle (5.5% user, 46.75% system time) The question I have for the list, given my large drive sizes, it takes me a day to set up and build an mdraid/lvm configuration. Has anybody found the "sweet spot" for how many partitions per drive? I now have a script to generate the drive partitions, a script for building the mdraid volumes, and a procedure for unwinding from all of this and starting again. If anybody knows the point of diminishing return for the number of partitions per drive to max out at, it would save me a few days of letting 32 run for a day, reconfiguring for 16, 8, 4, 2, 1....I could just tear apart my LVMs and remake them with half as many RAID partitions, but depending upon how the nvme drive is "RAINed" across NAND chips, I might leave performance on the table. The researcher in me says, start over, don't make ANY assumptions. As an aside, on the server, I'm maintaining around 1.1M NUMA aware IOPS per drive, when hitting all 24 drives individually without RAID, so I'm thrilled with the performance ceiling with the RAID, I just have to find a way to make it something somebody would be willing to maintain. Somewhere is a sweet spot between sustainability and performance. Once I find that I have to figure out if there is something useful to do with this new toy..... Regards, Jim -----Original Message----- From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Sent: Monday, August 9, 2021 3:02 PM To: 'Gal Ofri' <gal.ofri@xxxxxxxxxxx>; 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx> Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? Sequential Performance: BLUF, 1M sequential, direct I/O reads, QD 128 - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS. Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores. I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize.... I think I ran for about 40 minutes with the 1M reads... socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ... socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ... fio-3.26 Starting 128 processes fio: terminating on signal 2 socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug 9 18:53:36 2021 read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec) slat (usec): min=12, max=4463, avg=24.86, stdev=15.58 clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51 lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00 clat percentiles (msec): | 1.00th=[ 3], 5.00th=[ 5], 10.00th=[ 7], 20.00th=[ 17], | 30.00th=[ 106], 40.00th=[ 116], 50.00th=[ 209], 60.00th=[ 226], | 70.00th=[ 236], 80.00th=[ 321], 90.00th=[ 351], 95.00th=[ 372], | 99.00th=[ 472], 99.50th=[ 481], 99.90th=[ 1267], 99.95th=[ 1401], | 99.99th=[ 1586] bw ( MiB/s): min= 967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433 iops : min= 929, max=114304, avg=45879.39, stdev=330.41, samples=333433 lat (usec) : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06% lat (msec) : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48% lat (msec) : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01% lat (msec) : 2000=0.15% cpu : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug 9 18:53:36 2021 read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec) slat (usec): min=12, max=57137, avg=23.77, stdev=27.80 clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10 lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27 clat percentiles (usec): | 1.00th=[ 570], 5.00th=[ 693], 10.00th=[ 2573], | 20.00th=[ 21103], 30.00th=[ 102237], 40.00th=[ 143655], | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116], | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082], | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159], | 99.95th=[1166017], 99.99th=[1367344] bw ( MiB/s): min= 599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904 iops : min= 568, max=124809, avg=40554.92, stdev=319.34, samples=333904 lat (usec) : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60% lat (msec) : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71% lat (msec) : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10% lat (msec) : 2000=0.14% cpu : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=128 Run status group 0 (all jobs): READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec Run status group 1 (all jobs): READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec Disk stats (read/write): md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% -----Original Message----- From: Gal Ofri <gal.ofri@xxxxxxxxxxx> Sent: Sunday, August 8, 2021 10:44 AM To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> Cc: 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx> Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing????? On Thu, 5 Aug 2021 21:10:40 +0000 "Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@xxxxxxxx> wrote: > BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 > RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s - best hero numbers I've ever seen on mdraid RAID5 IOPS. I think the kernel patch is good. Prior was socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s.... I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control. That's great ! Thanks for sharing your results. I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here. > In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles. Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced. I change nothing but see this variance. I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now md0 is doing 5M IOPS and md1 6.3M Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that. I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching md/raid5 again soon to improve write workloads. I'll ping you when I have a patch that might be relevant. Cheers, Gal