Question for a kernel developer "in the know" - RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@xxxxxxxx> · Thu, 23 Sep 2021 21:05:56 +0000

All,
I have a question and this is silly but I have to share.   I'm well beyond the age to now dive into kernel code to diff it, but I have a question for those "in the know".

We switched from the 5.14rc4 kernel and rc6 (I believe) to the stock 5.14.0 kernel and my >84GB/s sustained bandwidth is now hovering at 100GB/s with no changes whatsoever by me in any tuning of the mdraid or the SSDs.   My question revolves around when there are "rc" kernels, is their extra debugging tracing or any other built in "inefficiencies" in release candidates that get configured out of the released kernels?

I'm puzzled by the performance gain of > 20% when I did nothing that I can take credit for.   I run the same FIO configs, the same Sys Admin built both the release candidates and released kernel, I haven't done any more bios tweaking.   My lab was powered off for two weeks for machine room cooling upgrades, so my machine room is cooler, but there is nothing I can point to.    If some "inefficiencies" weren't removed in the move to release kernel, I can only assume  "the more I learn about SSDs, the less I actually understand about SSDs".

I still owe a 5.15rc1 or later test because there looked to be a patch there that would impact the amount of I/Os that mdraid could queue up, so I expect there to be more of a performance gain there.   I'm somehow hoping that patch, "[PATCH] blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues" , snuck into the 5.14.0 kernel, then it will all make sense. 

Regards,
Jim

-----Original Message-----
From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> 
Sent: Tuesday, August 17, 2021 5:21 PM
To: 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx>
Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

All,
A quick random performance update (this is the best I can do in "going for it" with all of the guidance from this list) - I'm thrilled.....

5.14rc4 kernel Gen 4 drives, all AMD Rome BIOS tuning to keep I/O from power throttling,  SMT turned on (off yielded higher performance but left no room for anything else),  15.36TB drives cut into 32 equal partitions,  32 NUMA aligned raid5 9+1s from the same partition on NUMA0 combined with an LVM concatenating all 32 RAID5's into one volume.    I then do the exact same thing on NUMA1.

4K random reads, SMT off, sustained bandwidth of > 90GB/s, sustained IOPS across both LVMs, ~23M - bad part, only 7% of the system left to do anything useful 4K random reads, SMT on, sustained bandwidth of > 84GB/s, sustained IOPS across both LVMs, ~21M - 46.7% idle (.73% users, 52.6% system time) Takeaway - IMHO, no reason to turn off SMT, it helps way more than it hurts...

Without the partitioning and lvm shenanigans, with SMT on, 5.14rc4 kernel, most AMD BIOS tuning (not all), I'm at 46GB/s, 11.7M IOPS , 42.2% idle (3% user, 54.7% system time)

With stock RHEL 8.4, 4.18 kernel, SMT on, both partitioning and LVM shenanigans, most AMD BIOS tuning (not all), I'm at 81.5GB/s, 20.4M IOPS, 49% idle (5.5% user, 46.75% system time)

The question I have for the list, given my large drive sizes, it takes me a day to set up and build an mdraid/lvm configuration.    Has anybody found the "sweet spot" for how many partitions per drive?    I now have a script to generate the drive partitions, a script for building the mdraid volumes, and a procedure for unwinding from all of this and starting again.    

If anybody knows the point of diminishing return for the number of partitions per drive to max out at, it would save me a few days of letting 32 run for a day, reconfiguring for 16, 8, 4, 2, 1....I could just tear apart my LVMs and remake them with half as many RAID partitions, but depending upon how the nvme drive is "RAINed" across NAND chips, I might leave performance on the table.   The researcher in me says, start over, don't make ANY assumptions.

As an aside, on the server, I'm maintaining around 1.1M  NUMA aware IOPS per drive, when hitting all 24 drives individually without RAID, so I'm thrilled with the performance ceiling with the RAID, I just have to find a way to make it something somebody would be willing to maintain.   Somewhere is a sweet spot between sustainability and performance.   Once I find that I have to figure out if there is something useful to do with this new toy.....

Regards,
Jim

-----Original Message-----
From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
Sent: Monday, August 9, 2021 3:02 PM
To: 'Gal Ofri' <gal.ofri@xxxxxxxxxxx>; 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx>
Cc: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Sequential Performance:
BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores.    I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I ran for about 40 minutes with the 1M reads...

socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128 ...
fio-3.26
Starting 128 processes

fio: terminating on signal 2

socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 18:53:36 2021
  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   17],
     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[  226],
     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[  372],
     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[ 1401],
     | 99.99th=[ 1586]
   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433
   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41, samples=333433
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01%
  lat (msec)   : 2000=0.15%
  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 18:53:36 2021
  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
    clat percentiles (usec):
     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
     | 99.95th=[1166017], 99.99th=[1367344]
   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904
   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34, samples=333904
  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10%
  lat (msec)   : 2000=0.14%
  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec

Run status group 1 (all jobs):
   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec

Disk stats (read/write):
    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

-----Original Message-----
From: Gal Ofri <gal.ofri@xxxxxxxxxxx>
Sent: Sunday, August 8, 2021 10:44 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
Cc: 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx>
Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

On Thu, 5 Aug 2021 21:10:40 +0000
"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@xxxxxxxx> wrote:

> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1
> RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.
That's great !
Thanks for sharing your results.
I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here.

> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that.

I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching
md/raid5 again soon to improve write workloads.
I'll ping you when I have a patch that might be relevant.

Cheers,
Gal