>>> On Fri, 30 Jul 2021 16:45:32 +0800, Miao Wang >>> <shankerwangmiao@xxxxxxxxx> said: > [...] was also stuck in a similar problem and finally gave > up. Since it is very difficult to find such environment with > so many fast nvme drives, I wonder if you have any interest in > ZFS. [...] Or Btrfs or the new 'bachefs' which is OK for simple configurations (RAID10-like). But part of the issue here with MD RAID is that it is in theory mostly a translation layer like 'loop', but also sort of like a virtual block device too, and weird things happen as IO requests get reshape and requeued. My impression that I mentioned in a previous message is that probably the critical detail is: >> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util >> nvme0n1 1317510.00 0.00 5270044.00 0.00 0.00 0.00 0.00 0.00 0.31 0.00 411.95 4.00 0.00 0.00 100.40 >> [...] >> Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util >> nvme0n1 114589.00 0.00 458356.00 0.00 0.00 0.00 0.00 0.00 0.29 0.00 33.54 4.00 0.00 0.01 100.00 > The obvious difference is the factor of 10 in "aqu-sz" and that > correspond to the factor of 10 in "r/s" and "rkB/s". That may happen because the test is run directly on the 'md[01]' block device, which can do odd things. Counterintutively much bigger 'aqu-sz' and thus much better speed could be achieved by doing the test using a suitable filesystem on top of the 'md[01]' device. With ZFS there is a good chance that since striping is integrated within ZFS that could happen too, especially on highly parallel workloads. There is however a huge warning: the test is run on IOPS with 4KiB blocks, and ZFS in COW mode does not work well with that (especially for writes, but also for reads, if compression and checksumming are enabled, for RAIDz) so I think that it should be run with COW disabled, or perhaps on a 'zvol'.