Re: [Non-DoD Source] Can't get RAID5/RAID6 NVMe randomread IOPS - AMD ROME what am I missing?????

pg@xxxxxxxxxxxxxxxxxxxxxx (Peter Grandi) · Fri, 30 Jul 2021 15:17:32 +0200

>>> On Fri, 30 Jul 2021 16:45:32 +0800, Miao Wang
>>> <shankerwangmiao@xxxxxxxxx> said:

> [...] was also stuck in a similar problem and finally gave
> up. Since it is very difficult to find such environment with
> so many fast nvme drives, I wonder if you have any interest in
> ZFS. [...]

Or Btrfs or the new 'bachefs' which is OK for simple
configurations (RAID10-like).

But part of the issue here with MD RAID is that it is in theory
mostly a translation layer like 'loop', but also sort of like a
virtual block device too, and weird things happen as IO requests
get reshape and requeued.

My impression that I mentioned in a previous message is that
probably the critical detail is:

>> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>> nvme0n1       1317510.00    0.00 5270044.00      0.00     0.00     0.00   0.00   0.00    0.31    0.00 411.95     4.00     0.00   0.00 100.40
>> [...]
>> Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>> nvme0n1       114589.00    0.00 458356.00      0.00     0.00     0.00   0.00   0.00    0.29    0.00  33.54     4.00     0.00   0.01 100.00

> The obvious difference is the factor of 10 in "aqu-sz" and that
> correspond to the factor of 10 in "r/s" and "rkB/s".

That may happen because the test is run directly on the 'md[01]'
block device, which can do odd things. Counterintutively much
bigger 'aqu-sz' and thus much better speed could be achieved by
doing the test using a suitable filesystem on top of the 'md[01]'
device.

With ZFS there is a good chance that since striping is integrated
within ZFS that could happen too, especially on highly parallel
workloads.

There is however a huge warning: the test is run on IOPS with
4KiB blocks, and ZFS in COW mode does not work well with that
(especially for writes, but also for reads, if compression and
checksumming are enabled, for RAIDz) so I think that it should be
run with COW disabled, or perhaps on a 'zvol'.