RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



As far as the slower hero numbers - false alarm on my part - rebooted with 4.18 RHEL 8.4 kernel
Socket0 hero - 13.2M IOPS, Socket1 hero 13.7M IOPS.   I have to figure out the differences either between my drives or my server.  Chances are, slot for slot I have PCIe cards that are in different slots between the two servers if I had to guess....

As a major flag though - with mdraid volumes I created under the 5.14rc3 kernel, I lock the system up solid when I try to access them under 4.18.....I'm not an expert on forcing NMI's and getting the stack traces, so I might have to leave that to others.....After two lockups, I returned to the 5.14 kernel.   If I need to run something - you have seen the config I have - I'm willing.   

I'm willing to push as hard as I can and to run anything that can help as long as it isn't urgent - I have a day job and have some constraints as a civil servant, however, I have the researcher, push push push mindset.   I want to really encourage the community to push as hard as possible on protected IOPS and I'm willing to help however I can....In my interactions with the processor and server OEMs - I'm encouraging them to get the Linux leaders in I/O development, the biggest baddest Server/SSD combinations they have early in the development.   I know they won't listen to me but I'm trying to help.

For those of you on Rome server, get with your server provider.   There are some things in the BIOS that can be tweaked for I/O.   


-----Original Message-----
From: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> 
Sent: Thursday, August 5, 2021 3:52 PM
To: 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx>
Cc: 'Gal Ofri' <gal.ofri@xxxxxxxxxxx>; Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
Subject: RE: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Sorry - again..I sent HTML instead of plain text

Resend - mailing list bounce
All,
Sorry for the delay - both work and life got into the way.   Here is some feedback:

BLUF upfront with 5.14rc3 kernel that our SA built - md0 a 10+1+1 RAID5 - 5.332 M IOPS 20.3GiB/s, md1 a 10+1+1 RAID5, 5.892M IOPS 22.5GiB/s  - best hero numbers I've ever seen on mdraid  RAID5 IOPS.   I think the kernel patch is good.  Prior was  socket0 1.263M IOPS 4934MiB/s, socket1 1.071M IOSP, 4183MiB/s....   I'm willing to help push this as hard as we can until we hit a bottleneck outside of our control.   

I need to verify the RAW IOPS - admittedly this is a different server and I didn't do any regression testing before the kernel, but my raw were  socket0: 13.2M IOPS and socket1  13.5M IOPS.   Prior was socket0 16.0M IOPS and socket1 13.5M IOPS.   - admittedly there appears to a regression in the socket0 "hero run" but what I don't know that since this is a different server, I don't know if I have a configuration management issue in my zealousness to test this patch or whether we have a regression.   I was so excited to have the attention of kernel developers that needed my help that I borrowed another system, because I didn't want to tear apart my "Frankenstein's monster" 32 partition mdraid LVM mess.   If I can switch kernels and reboot before work and life get back in the way, I'll follow  up..

I think I might have to give myself the action to run this to ground next week on the other server.   Without a doubt the mdraid lock improvement is worth taking forward.   I either have to find my error or point a finger as my raw hero numbers got worse.   I tend to see one socket outrun another -  the way HPE allocates the nvme drives to pcie root complexes  is not how I'd like to do it so the drives are unbalanced on the PCIe root complexes (drives are in 4 different root complexes on socket 0 and 3 on socket 1, so one would think socket0 will always be faster for hero runs  (an NPS4 numa mapping is the best way to show it:
[root@gremlin04 hornet05]# cat *nps4
#filename=/dev/nvme0n1 0
#filename=/dev/nvme1n1 0
#filename=/dev/nvme2n1 1
#filename=/dev/nvme3n1 1
#filename=/dev/nvme4n1 2
#filename=/dev/nvme5n1 2
#filename=/dev/nvme6n1 2
#filename=/dev/nvme7n1 2
#filename=/dev/nvme8n1 3
#filename=/dev/nvme9n1 3
#filename=/dev/nvme10n1 3
#filename=/dev/nvme11n1 3
#filename=/dev/nvme12n1 4
#filename=/dev/nvme13n1 4
#filename=/dev/nvme14n1 4
#filename=/dev/nvme15n1 4
#filename=/dev/nvme17n1 5
#filename=/dev/nvme18n1 5
#filename=/dev/nvme19n1 5
#filename=/dev/nvme20n1 5
#filename=/dev/nvme21n1 6
#filename=/dev/nvme22n1 6
#filename=/dev/nvme23n1 6
#filename=/dev/nvme24n1 6


fio fiojim.hpdl385.nps1
socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket0-md: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
socket1-md: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
fio-3.26
Starting 256 processes
Jobs: 128 (f=128): [_(128),r(128)][1.5%][r=42.8GiB/s][r=11.2M IOPS][eta 10h:40m:00s]
socket0: (groupid=0, jobs=64): err= 0: pid=522428: Thu Aug  5 19:33:05 2021
  read: IOPS=13.2M, BW=50.2GiB/s (53.9GB/s)(14.7TiB/300005msec)
    slat (nsec): min=1312, max=8308.1k, avg=2206.72, stdev=1505.92
    clat (usec): min=14, max=42033, avg=619.56, stdev=671.45
     lat (usec): min=19, max=42045, avg=621.83, stdev=671.46
    clat percentiles (usec):
     |  1.00th=[  113],  5.00th=[  149], 10.00th=[  180], 20.00th=[  229],
     | 30.00th=[  273], 40.00th=[  310], 50.00th=[  351], 60.00th=[  408],
     | 70.00th=[  578], 80.00th=[  938], 90.00th=[ 1467], 95.00th=[ 1909],
     | 99.00th=[ 3163], 99.50th=[ 4178], 99.90th=[ 5800], 99.95th=[ 6390],
     | 99.99th=[ 8455]
   bw (  MiB/s): min=28741, max=61365, per=18.56%, avg=51489.80, stdev=82.09, samples=38016
   iops        : min=7357916, max=15709528, avg=13181362.22, stdev=21013.83, samples=38016
  lat (usec)   : 20=0.01%, 50=0.02%, 100=0.42%, 250=24.52%, 500=42.21%
  lat (usec)   : 750=7.94%, 1000=6.34%
  lat (msec)   : 2=14.26%, 4=3.74%, 10=0.54%, 20=0.01%, 50=0.01%
  cpu          : usr=14.58%, sys=47.48%, ctx=291912925, majf=0, minf=10492
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=3949519687,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1: (groupid=1, jobs=64): err= 0: pid=522492: Thu Aug  5 19:33:05 2021
  read: IOPS=13.6M, BW=51.8GiB/s (55.7GB/s)(15.2TiB/300004msec)
    slat (nsec): min=1323, max=4335.7k, avg=2242.27, stdev=1608.25
    clat (usec): min=14, max=41341, avg=600.15, stdev=726.62
     lat (usec): min=20, max=41358, avg=602.46, stdev=726.64
    clat percentiles (usec):
     |  1.00th=[  115],  5.00th=[  151], 10.00th=[  184], 20.00th=[  231],
     | 30.00th=[  269], 40.00th=[  306], 50.00th=[  347], 60.00th=[  400],
     | 70.00th=[  506], 80.00th=[  799], 90.00th=[ 1303], 95.00th=[ 1909],
     | 99.00th=[ 3589], 99.50th=[ 4424], 99.90th=[ 7111], 99.95th=[ 7767],
     | 99.99th=[10290]
   bw (  MiB/s): min=28663, max=71847, per=21.11%, avg=53145.09, stdev=111.29, samples=38016
   iops        : min=7337860, max=18392866, avg=13605117.00, stdev=28491.19, samples=38016
  lat (usec)   : 20=0.01%, 50=0.02%, 100=0.36%, 250=24.52%, 500=44.77%
  lat (usec)   : 750=8.90%, 1000=6.37%
  lat (msec)   : 2=10.52%, 4=3.87%, 10=0.66%, 20=0.01%, 50=0.01%
  cpu          : usr=14.86%, sys=49.40%, ctx=282634154, majf=0, minf=10276
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=4076360454,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket0-md: (groupid=2, jobs=64): err= 0: pid=524061: Thu Aug  5 19:33:05 2021
  read: IOPS=5332k, BW=20.3GiB/s (21.8GB/s)(6102GiB/300002msec)
    slat (nsec): min=1633, max=17043k, avg=11123.38, stdev=8694.61
    clat (usec): min=186, max=18705, avg=1524.87, stdev=115.29
     lat (usec): min=200, max=18743, avg=1536.08, stdev=115.90
    clat percentiles (usec):
     |  1.00th=[ 1270],  5.00th=[ 1336], 10.00th=[ 1369], 20.00th=[ 1418],
     | 30.00th=[ 1467], 40.00th=[ 1500], 50.00th=[ 1532], 60.00th=[ 1549],
     | 70.00th=[ 1582], 80.00th=[ 1631], 90.00th=[ 1680], 95.00th=[ 1713],
     | 99.00th=[ 1795], 99.50th=[ 1811], 99.90th=[ 1893], 99.95th=[ 1926],
     | 99.99th=[ 2089]
   bw (  MiB/s): min=19030, max=21969, per=100.00%, avg=20843.43, stdev= 5.35, samples=38272
   iops        : min=4871687, max=5624289, avg=5335900.01, stdev=1370.43, samples=38272
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=5.56%, sys=77.91%, ctx=8118, majf=0, minf=9018
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
    issued rwts: total=1599503201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128
socket1-md: (groupid=3, jobs=64): err= 0: pid=524125: Thu Aug  5 19:33:05 2021
  read: IOPS=5892k, BW=22.5GiB/s (24.1GB/s)(6743GiB/300002msec)
    slat (nsec): min=1663, max=1274.1k, avg=9896.09, stdev=7939.50
    clat (usec): min=236, max=11102, avg=1379.86, stdev=148.64
     lat (usec): min=239, max=11110, avg=1389.84, stdev=149.54
    clat percentiles (usec):
     |  1.00th=[ 1106],  5.00th=[ 1172], 10.00th=[ 1205], 20.00th=[ 1254],
     | 30.00th=[ 1287], 40.00th=[ 1336], 50.00th=[ 1369], 60.00th=[ 1401],
     | 70.00th=[ 1434], 80.00th=[ 1500], 90.00th=[ 1582], 95.00th=[ 1663],
     | 99.00th=[ 1811], 99.50th=[ 1860], 99.90th=[ 1942], 99.95th=[ 1958],
     | 99.99th=[ 2040]
   bw (  MiB/s): min=20982, max=24535, per=-82.15%, avg=23034.61, stdev=15.46, samples=38272
   iops        : min=5371404, max=6281119, avg=5896843.14, stdev=3958.21, samples=38272
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=99.97%, 4=0.02%, 10=0.01%, 20=0.01%
  cpu          : usr=6.55%, sys=74.98%, ctx=9833, majf=0, minf=8956
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1767618924,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=50.2GiB/s (53.9GB/s), 50.2GiB/s-50.2GiB/s (53.9GB/s-53.9GB/s), io=14.7TiB (16.2TB), run=300005-300005msec

Run status group 1 (all jobs):
   READ: bw=51.8GiB/s (55.7GB/s), 51.8GiB/s-51.8GiB/s (55.7GB/s-55.7GB/s), io=15.2TiB (16.7TB), run=300004-300004msec

Run status group 2 (all jobs):
   READ: bw=20.3GiB/s (21.8GB/s), 20.3GiB/s-20.3GiB/s (21.8GB/s-21.8GB/s), io=6102GiB (6552GB), run=300002-300002msec

Run status group 3 (all jobs):
   READ: bw=22.5GiB/s (24.1GB/s), 22.5GiB/s-22.5GiB/s (24.1GB/s-24.1GB/s), io=6743GiB (7240GB), run=300002-300002msec

Disk stats (read/write):
  nvme0n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme1n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme2n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme7n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme8n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme9n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme10n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme11n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme12n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme13n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme14n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme15n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme17n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme18n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme19n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme20n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme21n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme22n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme23n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  nvme24n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
  md0: ios=1599378656/0, merge=0/0, ticks=391992721/0, in_queue=391992721, util=100.00%
  md1: ios=1767484212/0, merge=0/0, ticks=427666887/0, in_queue=427666887, util=100.00%

From: Gal Ofri <gal.ofri@xxxxxxxxxxx>
Sent: Wednesday, July 28, 2021 5:43 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>; 'linux-raid@xxxxxxxxxxxxxxx' <linux-raid@xxxxxxxxxxxxxxx>
Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

All active links contained in this email were disabled. Please verify the identity of the sender, and confirm the authenticity of all links contained within the message prior to copying and pasting the address to a Web browser. 
________________________________________

A recent commit raised the limit on raid5/6 read iops.
It's available in 5.14.
See Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e ;< Caution-https://github.com/torvalds/linux/commit/97ae27252f4962d0fcc38ee1d9f913d817a2024e ;> commit 97ae27252f4962d0fcc38ee1d9f913d817a2024e
Author: Gal Ofri <gal.ofri@xxxxxxxxxx>
Date:   Mon Jun 7 14:07:03 2021 +0300
    md/raid5: avoid device_lock in read_one_chunk()

Please do share if you reach more iops in your env than described in the commit.

Cheers,
Gal,
Volumez (formerly storing.io)




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux