RE: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



There is interest in ZFS.   We're waiting for the direct I/O patches to settle in Open ZFS because we couldn't find any way to get around the ARC (everything has to touch the ARC).  ZFS spins an entire CPU core or more worrying about which ARC entries it has to evict.      I know who is doing the work.   Once it settles, I'll see if they are willing to publish to zfs-discuss.

-----Original Message-----
From: Miao Wang <shankerwangmiao@xxxxxxxxx> 
Sent: Friday, July 30, 2021 4:46 AM
To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
Cc: Matt Wallis <mattw@xxxxxxxxxxxx>; linux-raid@xxxxxxxxxxxxxxx
Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

Hi Jim,

Nice to hear about your findings on how to let linux md work better on fast nvme drives, because previously I was also stuck in a similar problem and finally gave up. Since it is very difficult to find such environment with so many fast nvme drives, I wonder if you have any interest in ZFS. Maybe you can set up a similar raidz configuration on those drives and see whether its performance is better or worse.

Cheers,

Miao Wang

> 2021年07月30日 16:28,Matt Wallis <mattw@xxxxxxxxxxxx> 写道:
> 
> Hi Jim,
> 
> That’s significantly better than I expected, I need to see if I can get someone to send me the box I was using so I can spend some more time on it.
> 
> Good luck with the rest of it, the bit I was looking for as a next step was going to be potentially tweaking stripe widths and the like to see how much difference it made on different workloads.
> 
> Matt. 
> 
>> On 30 Jul 2021, at 08:05, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> wrote:
>> 
>> Matt,
>> Thank you for the tip.   I have put 32 partitions on each of my NVMe drives, made 32 raid5 stripes and then made an LVM from the 32 raid5 stripes.   I then created one physical volume per 10 NVMe drives on each socket.    I believe I have successfully harnessed "alien technology" (inside joke) to create a Frankenstein's monster.   These are substantially better than doing a RAID0 stripe over the partitioned md's in the past.  I whipped all of this together in two 15 minute sessions last night and just right now, so I might run some more extensive tests when I have the cycles.   I didn't intend to leave the thread hanging.
>> BLUF -  fio detailed output below.....
>> 9 drives per socket raw
>> socket0, 9 drives, raw 4K random reads 13.6M IOPS , socket1 9 drives, 
>> raw 4K random reads , 12.3M IOPS
>> %Cpu(s):  4.4 us, 25.6 sy,  0.0 ni, 56.7 id,  0.0 wa, 13.1 hi,  0.2 
>> si,  0.0 st
>> 
>> 9 data drives per socket RAID5/LVM raw (9+1) socket0, 9 drives, raw 
>> 4K random reads 8.57M IOPS , socket1 9 drives, raw 4K random reads , 
>> 8.57M IOPS
>> %Cpu(s):  7.0 us, 22.3 sy,  0.0 ni, 58.4 id,  0.0 wa, 12.1 hi,  0.2 
>> si,  0.0 st
>> 
>> 
>> All,
>> I intend to test the 4.15 kernel patch next week.   My SA would prefer if the patch has made it into the kernel-ml stream yet, so he could install an rpm, but I'll get him to build the kernel if need be.
>> If the 4.15 kernel patch doesn't alleviate the issues, I have a strong desire to have mdraid made better.
>> 
>> 
>> Quick fio results:
>> socket0: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket1: (g=1): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 
>> 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket0-lv: (g=2): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
>> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> socket1-lv: (g=3): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, 
>> (T) 4096B-4096B, ioengine=libaio, iodepth=128 ...
>> fio-3.26
>> Starting 256 processes
>> 
>> socket0: (groupid=0, jobs=64): err= 0: pid=64032: Thu Jul 29 21:48:32 
>> 2021
>> read: IOPS=13.6M, BW=51.7GiB/s (55.6GB/s)(3105GiB/60003msec)
>>   slat (nsec): min=1292, max=1376.7k, avg=2545.32, stdev=2696.74
>>   clat (usec): min=36, max=71580, avg=600.68, stdev=361.36
>>    lat (usec): min=38, max=71616, avg=603.30, stdev=361.38
>>   clat percentiles (usec):
>>    |  1.00th=[  169],  5.00th=[  231], 10.00th=[  277], 20.00th=[  347],
>>    | 30.00th=[  404], 40.00th=[  457], 50.00th=[  519], 60.00th=[  594],
>>    | 70.00th=[  676], 80.00th=[  791], 90.00th=[  996], 95.00th=[ 1205],
>>    | 99.00th=[ 1909], 99.50th=[ 2409], 99.90th=[ 3556], 99.95th=[ 3884],
>>    | 99.99th=[ 5538]
>>  bw (  MiB/s): min=39960, max=56660, per=20.94%, avg=53040.69, stdev=49.61, samples=7488
>>  iops        : min=10229946, max=14504941, avg=13578391.80, stdev=12699.61, samples=7488
>> lat (usec)   : 50=0.01%, 100=0.01%, 250=6.83%, 500=40.06%, 750=29.92%
>> lat (usec)   : 1000=13.42%
>> lat (msec)   : 2=8.91%, 4=0.82%, 10=0.04%, 20=0.01%, 50=0.01%
>> lat (msec)   : 100=0.01%
>> cpu          : usr=14.82%, sys=46.57%, ctx=35564249, majf=0, minf=9754
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=813909201,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket1: (groupid=1, jobs=64): err= 0: pid=64096: Thu Jul 29 21:48:32 
>> 2021
>> read: IOPS=12.3M, BW=46.9GiB/s (50.3GB/s)(2812GiB/60003msec)
>>   slat (nsec): min=1292, max=1672.2k, avg=2672.21, stdev=2742.06
>>   clat (usec): min=25, max=73526, avg=663.35, stdev=611.06
>>    lat (usec): min=28, max=73545, avg=666.09, stdev=611.08
>>   clat percentiles (usec):
>>    |  1.00th=[  143],  5.00th=[  190], 10.00th=[  227], 20.00th=[  285],
>>    | 30.00th=[  338], 40.00th=[  400], 50.00th=[  478], 60.00th=[  586],
>>    | 70.00th=[  725], 80.00th=[  930], 90.00th=[ 1254], 95.00th=[ 1614],
>>    | 99.00th=[ 3490], 99.50th=[ 4146], 99.90th=[ 6390], 99.95th=[ 6980],
>>    | 99.99th=[ 8356]
>>  bw (  MiB/s): min=28962, max=55326, per=12.70%, avg=48036.10, stdev=96.75, samples=7488
>>  iops        : min=7414327, max=14163615, avg=12297214.98, stdev=24768.82, samples=7488
>> lat (usec)   : 50=0.01%, 100=0.03%, 250=13.75%, 500=38.84%, 750=18.71%
>> lat (usec)   : 1000=11.55%
>> lat (msec)   : 2=14.30%, 4=2.23%, 10=0.60%, 20=0.01%, 50=0.01%
>> lat (msec)   : 100=0.01%
>> cpu          : usr=13.41%, sys=44.44%, ctx=39379711, majf=0, minf=9982
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=737168913,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket0-lv: (groupid=2, jobs=64): err= 0: pid=64166: Thu Jul 29 
>> 21:48:32 2021
>> read: IOPS=8570k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>>   slat (nsec): min=1873, max=11085k, avg=4694.47, stdev=4825.39
>>   clat (usec): min=24, max=21739, avg=950.52, stdev=948.83
>>    lat (usec): min=51, max=21743, avg=955.29, stdev=948.88
>>   clat percentiles (usec):
>>    |  1.00th=[  155],  5.00th=[  217], 10.00th=[  265], 20.00th=[  338],
>>    | 30.00th=[  404], 40.00th=[  486], 50.00th=[  594], 60.00th=[  766],
>>    | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>>    | 99.00th=[ 4490], 99.50th=[ 5669], 99.90th=[ 8586], 99.95th=[ 9896],
>>    | 99.99th=[12125]
>>  bw (  MiB/s): min=24657, max=37453, per=-25.17%, avg=33516.00, stdev=35.32, samples=7616
>>  iops        : min=6312326, max=9588007, avg=8580076.03, stdev=9041.88, samples=7616
>> lat (usec)   : 50=0.01%, 100=0.01%, 250=8.40%, 500=33.21%, 750=17.54%
>> lat (usec)   : 1000=9.89%
>> lat (msec)   : 2=19.93%, 4=9.58%, 10=1.40%, 20=0.05%, 50=0.01%
>> cpu          : usr=9.01%, sys=51.48%, ctx=27829950, majf=0, minf=9028
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=514275323,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> socket1-lv: (groupid=3, jobs=64): err= 0: pid=64230: Thu Jul 29 
>> 21:48:32 2021
>> read: IOPS=8571k, BW=32.7GiB/s (35.1GB/s)(1962GiB/60006msec)
>>   slat (nsec): min=1823, max=14362k, avg=4809.30, stdev=4940.42
>>   clat (usec): min=50, max=22856, avg=950.31, stdev=948.13
>>    lat (usec): min=54, max=22860, avg=955.19, stdev=948.19
>>   clat percentiles (usec):
>>    |  1.00th=[  157],  5.00th=[  221], 10.00th=[  269], 20.00th=[  343],
>>    | 30.00th=[  412], 40.00th=[  490], 50.00th=[  603], 60.00th=[  766],
>>    | 70.00th=[ 1029], 80.00th=[ 1418], 90.00th=[ 2089], 95.00th=[ 2737],
>>    | 99.00th=[ 4293], 99.50th=[ 5604], 99.90th=[ 9503], 99.95th=[10683],
>>    | 99.99th=[12649]
>>  bw (  MiB/s): min=23434, max=36909, per=-25.17%, avg=33517.14, stdev=50.36, samples=7616
>>  iops        : min=5999220, max=9448818, avg=8580368.69, stdev=12892.93, samples=7616
>> lat (usec)   : 100=0.01%, 250=7.88%, 500=33.09%, 750=18.09%, 1000=10.02%
>> lat (msec)   : 2=19.91%, 4=9.75%, 10=1.16%, 20=0.08%, 50=0.01%
>> cpu          : usr=9.14%, sys=51.94%, ctx=25524808, majf=0, minf=9037
>> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>>    submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>>    complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
>>    issued rwts: total=514294010,0,0,0 short=0,0,0,0 dropped=0,0,0,0
>>    latency   : target=0, window=0, percentile=100.00%, depth=128
>> 
>> Run status group 0 (all jobs):
>>  READ: bw=51.7GiB/s (55.6GB/s), 51.7GiB/s-51.7GiB/s 
>> (55.6GB/s-55.6GB/s), io=3105GiB (3334GB), run=60003-60003msec
>> 
>> Run status group 1 (all jobs):
>>  READ: bw=46.9GiB/s (50.3GB/s), 46.9GiB/s-46.9GiB/s 
>> (50.3GB/s-50.3GB/s), io=2812GiB (3019GB), run=60003-60003msec
>> 
>> Run status group 2 (all jobs):
>>  READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s 
>> (35.1GB/s-35.1GB/s), io=1962GiB (2106GB), run=60006-60006msec
>> 
>> Run status group 3 (all jobs):
>>  READ: bw=32.7GiB/s (35.1GB/s), 32.7GiB/s-32.7GiB/s 
>> (35.1GB/s-35.1GB/s), io=1962GiB (2107GB), run=60006-60006msec
>> 
>> Disk stats (read/write):
>> nvme0n1: ios=90336694/0, merge=0/0, ticks=45102163/0, 
>> in_queue=45102163, util=97.44%
>> nvme1n1: ios=90337153/0, merge=0/0, ticks=47422886/0, 
>> in_queue=47422887, util=97.81%
>> nvme2n1: ios=90337516/0, merge=0/0, ticks=46419782/0, 
>> in_queue=46419782, util=97.95%
>> nvme3n1: ios=90337843/0, merge=0/0, ticks=46256374/0, 
>> in_queue=46256374, util=97.95%
>> nvme4n1: ios=90337742/0, merge=0/0, ticks=59122226/0, 
>> in_queue=59122225, util=98.19%
>> nvme5n1: ios=90338813/0, merge=0/0, ticks=57811758/0, 
>> in_queue=57811758, util=98.33%
>> nvme6n1: ios=90339194/0, merge=0/0, ticks=57369337/0, 
>> in_queue=57369337, util=98.37%
>> nvme7n1: ios=90339048/0, merge=0/0, ticks=55791076/0, 
>> in_queue=55791076, util=98.78%
>> nvme8n1: ios=90340234/0, merge=0/0, ticks=44977001/0, 
>> in_queue=44977001, util=99.01%
>> nvme12n1: ios=81819608/0, merge=0/0, ticks=26788080/0, 
>> in_queue=26788079, util=99.24%
>> nvme13n1: ios=81819831/0, merge=0/0, ticks=26736682/0, 
>> in_queue=26736681, util=99.57%
>> nvme14n1: ios=81820006/0, merge=0/0, ticks=26772951/0, 
>> in_queue=26772951, util=99.67%
>> nvme15n1: ios=81820215/0, merge=0/0, ticks=26741532/0, 
>> in_queue=26741532, util=99.78%
>> nvme17n1: ios=81819922/0, merge=0/0, ticks=76459192/0, 
>> in_queue=76459192, util=99.84%
>> nvme18n1: ios=81820146/0, merge=0/0, ticks=86756309/0, 
>> in_queue=86756309, util=99.82%
>> nvme19n1: ios=81820481/0, merge=0/0, ticks=75008919/0, 
>> in_queue=75008919, util=100.00%
>> nvme20n1: ios=81819690/0, merge=0/0, ticks=91888274/0, 
>> in_queue=91888275, util=100.00%
>> nvme21n1: ios=81821809/0, merge=0/0, ticks=26653056/0, 
>> in_queue=26653057, util=100.00% -----Original Message-----
>> From: Matt Wallis <mattw@xxxxxxxxxxxx>
>> Sent: Wednesday, July 28, 2021 8:54 PM
>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
>> Cc: linux-raid@xxxxxxxxxxxxxxx
>> Subject: Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>> 
>> Hi Jim,
>> 
>> Totally get the Frankenstein’s monster aspect, I try not to build those where I can, but at the moment I don’t think there’s much that can be done about it.
>> Not sure if LVM is better than MDRAID 0, it just gives you more control over the volumes that can be created, instead of having it all in one big chunk. If you just need one big chunk, then MDRAID 0 is probably fine.
>> 
>> I think if you can create a couple of scripts that allows the admin to fail a drive out of all the arrays that it’s in at once, then it's not that much worse than managing an MDRAID is normally. 
>> 
>> Matt.
>> 
>>> On 28 Jul 2021, at 20:43, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> wrote:
>>> 
>>> Matt,
>>> I have put as many as 32 partitions on a drive (based upon great advice from this list) and done RAID6 over them, but I was concerned about our sustainability long term.     As a researcher, I can do these cool science experiments, but I still have to hand designs  to sustainment folks.  I was also running into an issue of doing a mdraid RAID0 on top of the RAID6's so I could toss one  xfs file system on top each of the numa node's drives and the last RAID0 stripe of all of the RAID6's couldn't generate the queue depth needed.    We even recompiled the kernel to change the mdraid nr_request max from 128 to 1023.  
>>> 
>>> I will have to try the LVM experiment.  I'm an LVM  neophyte, so it might take me the rest of today/tomorrow to get new results as I tend to let mdraid do all of its volume builds without forcing, so that will take a bit of time also.  Once might be able to argue that configuration isn't too much of a "Frankenstein's monster" for me to hand it off.
>>> 
>>> Thanks,
>>> Jim
>>> 
>>> 
>>> 
>>> 
>>> -----Original Message-----
>>> From: Matt Wallis <mattw@xxxxxxxxxxxx>
>>> Sent: Wednesday, July 28, 2021 6:32 AM
>>> To: Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx>
>>> Cc: linux-raid@xxxxxxxxxxxxxxx
>>> Subject: [Non-DoD Source] Re: Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????
>>> 
>>> Hi Jim,
>>> 
>>>> On 28 Jul 2021, at 06:32, Finlayson, James M CIV (USA) <james.m.finlayson4.civ@xxxxxxxx> wrote:
>>>> 
>>>> Sorry, this will be a long email with everything I find to be relevant, but I can get over 110GB/s of 4kB random reads from individual NVMe SSDs, but I'm at a loss why mdraid can only do a very small  fraction of it.   I'm at my "organizational world record" for sustained IOPS, but I need protected IOPS to do something useful.     This is everything I do to a server to make the I/O crank.....My role is that of a lab researcher/resident expert/consultant.   I'm just stumped why I can't do better.   If there is a fine manual that somebody can point me to, I'm happy to read it…
>>> 
>>> I am probably going to get corrected on some if not all of this, but from what I understand, and from my own little experiments on a similar Intel based system… 1. NVMe is stupid fast, you need a good chunk of CPU performance to max it out.
>>> 2. Most block IO in the kernel is limited in terms of threading, it 
>>> may even be essentially single threaded. (This is where I will get 
>>> corrected) 3. AFAICT, this includes mdraid, there’s a single thread 
>>> per RAID device handling all the RAID calculations. (mdX_raid6)
>>> 
>>> What I did to get IOPs up in a system with 24 NVMe, split up into 12 per NUMA domain.
>>> 1. Create 8 partitions on each drive (this may be overkill, I just started here for some reason) 2. Create 8 RAID6 arrays with 1 partition per drive.
>>> 3. Use LVM to create a single striped logical volume over all 8 RAID volumes. RAID 0+6 as it were.
>>> 
>>> You now have an LVM thread that is basically doing nothing more than chunking the data as it comes in, then, sending the chunks to 8 separate RAID devices, each with their own threads, buffers, queues etc, all of which can be spread over more cores.
>>> 
>>> I saw a significant (for me, significant is >20%) increase in IOPs doing this. 
>>> 
>>> You still have RAID6 protection, but you might want to write a couple of scripts to help you manage the arrays, because a single failed drive now needs to be removed from 8 RAID volumes. 
>>> 
>>> There’s not a lot of capacity lost doing this, pretty sure I lost less than 100MB to the partitions and the RAID overhead.
>>> 
>>> You would never consider this on spinning disk of course, way to slow and you’re just going to make it slower, NVMe as you noticed has the IOPs to spare, so I’m pretty sure it’s just that we’re not able to get the data to it fast enough.
>>> 
>>> Matt





[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux