Hi, Executive summary: On raid10,far2 and raid10,offset2, heavy write loads for some reason starve reads to an insane amount, even (*especially*) when the backing disks are queued by schedulers that are supposed to prioritize reads (deadline, mq-deadline). The IOPS for read:write appear to be in a 1:10 ratio, and throughput sees a similar ratio. Switching schedulers from cfq to deadline did not seem to make much of a difference, and the issue persists with mq-deadline and, to a lesser extent, with bfq. Benchmarking with fio directly on a newly created far2 and offset2 array confirms the issue is unrelated to LVM or filesystems on a higher level, and may not be fixable by simple scheduler switching. Any input, advice, testing suggestions, scheduler tweaking tips, is appreciated. This issue goes as far back as 2016 when I first started using raid10,far2. This is an array with 2 3TB Western Digital Red, 5200RPM drives, with 512k chunk size. I've noticed the issue in two real workload scenarios: - KVM / qemu / libvirt VM running Windows 7 directly on a linux LVM LV (so raw disk image, the LV is a hard disk partitioned with EFI system partition, reserved space, and correctly aligned NTFS partition). When doing a large (multi-GiB, Steam) download, loading game assets at the same time goes at a crawl measured in KiB/s, whereas normally they're loaded at about 100MiB/s. - rsyncing multi-GiB filesystem to an ext4 filesystem on a separate LVM LV, but on the same raid10 array, has the same effect on the loading of game assets; this made me conclude it's not caused by Windows' scheduling or somesuch. In both cases, when viewing iostat, disk IOPS was pretty much maxed out all the times. I recently acquired 2 new 4TB Western Digital Red, 5200RPM drives, and decided to finally do some benchmarks on a new raw raid array, without LVM or Windows or ext4 in the way. I'm now on a multiqueue kernel so the selection of schedulers is different, but I'd already noticed the same issue in normal operation. The array is again a 2-disk far2, with 512k chunk size. For benchmarking I used fio, running all normal workload tests with 1, 2, 4, 8, 16, and 32 jobs, with direct=1 to prevent hitting the OS-level file cache.[1] I also cooked up a test with separate job groups for randread and randwrite, which would be closest to "real" operation (no magic knowledge of one single job to balance reading and writing IOPS). All tests were done with mq-deadline and both 4k and 512k block sizes. Example fio commandlines for the tests: # single-group randrw test fio --name=job1 --blocksize=4k --iodepth=1 --direct=1 --fsync=0 \ --ioengine=sync --size=2G --numjobs=16 --runtime=60 \ --group_reporting --output=raid10-4TB-offset2-rw16-4k --rw=rw \ --filename=/dev/md127 # dual-group mixed-workload-test fio --group_reporting --output=raid10-4TB-far2-rw16-512k-dual \ --name=readers --blocksize=512k --iodepth=1 --direct=1 --fsync=0 \ --ioengine=sync --size=2G --numjobs=16 --runtime=60 --rw=randread \ --filename=/dev/md127 \ --name=writers --blocksize=512k --iodepth=1 --direct=1 --fsync=0 \ --ioengine=sync --size=2G --numjobs=16 --runtime=60 \ --rw=randwrite --filename=/dev/md127 A script for the benchmarks is available at <https://ptpb.pw/2Fdd.txt>. Disable the first loop if you're only interested in the relevant benchmarks, the dual-workload ones. Be careful, since this script may write *directly* to the raid ARRAY, corrupting data in the process. Results of the benchmarks are available at <https://ptpb.pw/Fk6F.tgz>. The most interesting ones are contained in the raid10-4TB-offset2-rw16-4k-dual and raid10-4TB-offset2-rw16-512k-dual files. Benchmarks done on mq-deadline clearly show that the IOPS of these job groups are distributed very unfairly, with a 1:8 ~ 1:10 ratio, sometimes as bad as 1:20, when the blocksize is 4k. With a 512k blocksize it gets closer to 1:4 with 32 writers and 32 readers, but this still has undesirable throughput distribution, also 1:4. Initially I blamed the increased seeking imposed by far2, due to needing to write on both an inside and an outside track. So I then tested on an offset2 layout, with 512k chunk size. But this has the same problem, with similar throughput characteristics. Logfiles for those are also included in the aforementioned tgz file. Testing with the bfq scheduler shows it also has some undesired behaviour, especially when there's only few readers / writers with small block sizes. The 1- and 2-job versions have a 1:4 and 1:2 ratio for read:writes, respectively, the 4-job version somehow has a 6:1 ratio, then the 8-job version goes bananas with 1:15, whereas 16 and 32 go for 1:7 and 7:1. In short: bfq is all over the place when facing a 4k blocksize. On a 512k blocksize, initially it's a "meh"-ratio, 1:3, when facing 1 or 2 readers and writers. On 4 it's suddenly almost 1:1, but 8 reverts to 1:10, 16 is 1:2 and 32 again 1:1. Again, bfq is all over the place, but seems better than mq-deadline. This feels like it shouldn't be the default behaviour. Personally, it makes my machine pretty much unusable when heavy writing is going on. I'd very much like to simply throttle the write throughput so that read throughput is on par or better than it. It's not even so much about latency of requests served, as it is about *amount* or *size* of read requests served. Considering offset2 doesn't really improve matters in the benchmark I'll probably stick with far2, so scheduler tweaking is my current plan. Problem is, I have no clue what to tweak in order to improve this. As mentioned at the start, any input, advice, testing suggestions, scheduler tweaking tips, is appreciated. Regards, Pol Van Aubel [1] As an aside, fio claims unrealistic write speeds for these disks when testing sequential writes, but those are not the important benchmarks here. I don't think it's due to disk cache, that's only 256MiB per disk. I mention it only for completeness.