RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sequential Performance:
BLUF, 1M sequential, direct I/O  reads, QD 128  - 85GiB/s across both 10+1+1 NUMA aware 128K striped LUNS.   Had the imbalance between NUMA 0 44.5GiB/s and NUMA 1 39.4GiB/s but still could be drifting power management on the AMD Rome cores.    I tried a 1280K blocksize to try to get a full stripe read, but Linux seems so unfriendly to non-power of 2 blocksizes.... performance decreased considerably (20GiB/s ?) with the 10x128KB blocksize....   I think I ran for about 40 minutes with the 1M reads...


socket0-md: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128
...
socket1-md: (g=1): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128
...
fio-3.26
Starting 128 processes

fio: terminating on signal 2

socket0-md: (groupid=0, jobs=64): err= 0: pid=1645360: Mon Aug  9 18:53:36 2021
  read: IOPS=45.6k, BW=44.5GiB/s (47.8GB/s)(114TiB/2626961msec)
    slat (usec): min=12, max=4463, avg=24.86, stdev=15.58
    clat (usec): min=249, max=1904.8k, avg=179674.12, stdev=138190.51
     lat (usec): min=295, max=1904.8k, avg=179699.07, stdev=138191.00
    clat percentiles (msec):
     |  1.00th=[    3],  5.00th=[    5], 10.00th=[    7], 20.00th=[   17],
     | 30.00th=[  106], 40.00th=[  116], 50.00th=[  209], 60.00th=[  226],
     | 70.00th=[  236], 80.00th=[  321], 90.00th=[  351], 95.00th=[  372],
     | 99.00th=[  472], 99.50th=[  481], 99.90th=[ 1267], 99.95th=[ 1401],
     | 99.99th=[ 1586]
   bw (  MiB/s): min=  967, max=114322, per=8.68%, avg=45897.69, stdev=330.42, samples=333433
   iops        : min=  929, max=114304, avg=45879.39, stdev=330.41, samples=333433
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.05%, 1000=0.06%
  lat (msec)   : 2=0.49%, 4=4.36%, 10=9.43%, 20=7.52%, 50=3.48%
  lat (msec)   : 100=2.70%, 250=47.39%, 500=24.25%, 750=0.09%, 1000=0.01%
  lat (msec)   : 2000=0.15%
  cpu          : usr=0.07%, sys=1.83%, ctx=77483816, majf=0, minf=37747
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=119750623,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=1, jobs=64): err= 0: pid=1645424: Mon Aug  9 18:53:36 2021
  read: IOPS=40.3k, BW=39.4GiB/s (42.3GB/s)(101TiB/2627054msec)
    slat (usec): min=12, max=57137, avg=23.77, stdev=27.80
    clat (usec): min=130, max=1746.1k, avg=203005.37, stdev=158045.10
     lat (usec): min=269, max=1746.1k, avg=203029.23, stdev=158045.27
    clat percentiles (usec):
     |  1.00th=[    570],  5.00th=[    693], 10.00th=[   2573],
     | 20.00th=[  21103], 30.00th=[ 102237], 40.00th=[ 143655],
     | 50.00th=[ 204473], 60.00th=[ 231736], 70.00th=[ 283116],
     | 80.00th=[ 320865], 90.00th=[ 421528], 95.00th=[ 455082],
     | 99.00th=[ 583009], 99.50th=[ 608175], 99.90th=[1061159],
     | 99.95th=[1166017], 99.99th=[1367344]
   bw (  MiB/s): min=  599, max=124821, per=-3.40%, avg=40571.79, stdev=319.36, samples=333904
   iops        : min=  568, max=124809, avg=40554.92, stdev=319.34, samples=333904
  lat (usec)   : 250=0.01%, 500=0.14%, 750=6.31%, 1000=2.60%
  lat (msec)   : 2=0.58%, 4=2.04%, 10=4.17%, 20=3.82%, 50=3.71%
  lat (msec)   : 100=5.91%, 250=32.86%, 500=33.81%, 750=3.81%, 1000=0.10%
  lat (msec)   : 2000=0.14%
  cpu          : usr=0.05%, sys=1.56%, ctx=71342745, majf=0, minf=37766
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=105992570,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=44.5GiB/s (47.8GB/s), 44.5GiB/s-44.5GiB/s (47.8GB/s-47.8GB/s), io=114TiB (126TB), run=2626961-2626961msec

Run status group 1 (all jobs):
   READ: bw=39.4GiB/s (42.3GB/s), 39.4GiB/s-39.4GiB/s (42.3GB/s-42.3GB/s), io=101TiB (111TB), run=2627054-2627054msec

Disk stats (read/write):
    md0: ios=960804546/0, merge=0/0, ticks=18446744072288672424/0, in_queue=18446744072288672424, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
    md1: ios=850399203/0, merge=0/0, ticks=2118156441/0, in_queue=2118156441, util=100.00%, aggrios=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

-----Original Message-----
From: Gal Ofri <gal.ofri@xxxxxxxxxxx> 
Sent: Sunday, August 8, 2021 10:44 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
Cc: 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx>
Subject: Re: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

On Thu, 5 Aug 2021 21:10:40 +0000
"Finlayson, James M CIV (USA)" <james.m.finlayson4.civ@xxxxxxxx> wrote:

> BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 
> RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.
That's great !
Thanks for sharing your results.
I'd appreciate if you could run a sequential-reads workload (128k/256k) so that we get a better sense of the throughput potential here.

> In my strict numa adherence with mdraid, I see lots of variability between reboots/assembles.    Sometimes md0 wins, sometimes md1 wins, and in my earlier runs md0 and md1 are notionally balanced.   I change nothing but see this variance.   I just cranked up a week long extended run of these 10+1+1s under the 5.14rc3 kernel and right now   md0 is doing 5M IOPS and md1 6.3M 
Given my humble experience with the code in question, I suspect that it is not really optimized for numa awareness, so I find your findings quite reasonable. I don't really have a good tip for that.

I'm focusing now on thin-provisioned logical volumes (lvm - it has a much worse reads bottleneck actually), but we have plans for researching
md/raid5 again soon to improve write workloads.
I'll ping you when I have a patch that might be relevant.

Cheers,
Gal




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux