Writes-starving-reads with raid10 on cfq, deadline, mq-deadline and possibly bfq

Pol Van Aubel <linux-raid@xxxxxxx> · Wed, 16 Jan 2019 00:57:17 +0100

Hi,

Executive summary:

On raid10,far2 and raid10,offset2, heavy write loads for some reason
starve reads to an insane amount, even (*especially*) when the backing
disks are queued by schedulers that are supposed to prioritize reads
(deadline, mq-deadline). The IOPS for read:write appear to be in a 1:10
ratio, and throughput sees a similar ratio. Switching schedulers from
cfq to deadline did not seem to make much of a difference, and the issue
persists with mq-deadline and, to a lesser extent, with bfq.
Benchmarking with fio directly on a newly created far2 and offset2 array
confirms the issue is unrelated to LVM or filesystems on a higher level,
and may not be fixable by simple scheduler switching.

Any input, advice, testing suggestions, scheduler tweaking tips, is
appreciated.

This issue goes as far back as 2016 when I first started using
raid10,far2. This is an array with 2 3TB Western Digital Red, 5200RPM
drives, with 512k chunk size. 

I've noticed the issue in two real workload scenarios:
- KVM / qemu / libvirt VM running Windows 7 directly on a linux LVM LV
  (so raw disk image, the LV is a hard disk partitioned with EFI system
  partition, reserved space, and correctly aligned NTFS partition).
  When doing a large (multi-GiB, Steam) download, loading game assets at
  the same time goes at a crawl measured in KiB/s, whereas normally
  they're loaded at about 100MiB/s.
- rsyncing multi-GiB filesystem to an ext4 filesystem on a separate LVM
  LV, but on the same raid10 array, has the same effect on the loading
  of game assets; this made me conclude it's not caused by Windows'
  scheduling or somesuch.

In both cases, when viewing iostat, disk IOPS was pretty much maxed out
all the times. 

I recently acquired 2 new 4TB Western Digital Red, 5200RPM drives, and
decided to finally do some benchmarks on a new raw raid array, without
LVM or Windows or ext4 in the way. I'm now on a multiqueue kernel so the
selection of schedulers is different, but I'd already noticed the same
issue in normal operation. The array is again a 2-disk far2, with 512k
chunk size.

For benchmarking I used fio, running all normal workload tests with 1,
2, 4, 8, 16, and 32 jobs, with direct=1 to prevent hitting the OS-level
file cache.[1] I also cooked up a test with separate job groups for
randread and randwrite, which would be closest to "real" operation (no
magic knowledge of one single job to balance reading and writing IOPS).
All tests were done with mq-deadline and both 4k and 512k block sizes.

Example fio commandlines for the tests:
# single-group randrw test
fio --name=job1 --blocksize=4k --iodepth=1 --direct=1 --fsync=0   \
    --ioengine=sync --size=2G --numjobs=16 --runtime=60           \
    --group_reporting --output=raid10-4TB-offset2-rw16-4k --rw=rw \
    --filename=/dev/md127

# dual-group mixed-workload-test
fio --group_reporting --output=raid10-4TB-far2-rw16-512k-dual         \
    --name=readers --blocksize=512k --iodepth=1 --direct=1 --fsync=0  \
    --ioengine=sync --size=2G --numjobs=16 --runtime=60 --rw=randread \
    --filename=/dev/md127                                             \
    --name=writers --blocksize=512k --iodepth=1 --direct=1 --fsync=0  \
    --ioengine=sync --size=2G --numjobs=16 --runtime=60               \
    --rw=randwrite --filename=/dev/md127

A script for the benchmarks is available at <https://ptpb.pw/2Fdd.txt>.
Disable the first loop if you're only interested in the relevant
benchmarks, the dual-workload ones. Be careful, since this script may
write *directly* to the raid ARRAY, corrupting data in the process.

Results of the benchmarks are available at <https://ptpb.pw/Fk6F.tgz>.
The most interesting ones are contained in the
raid10-4TB-offset2-rw16-4k-dual
and
raid10-4TB-offset2-rw16-512k-dual
files.

Benchmarks done on mq-deadline clearly show that the IOPS of these job
groups are distributed very unfairly, with a 1:8 ~ 1:10 ratio, sometimes
as bad as 1:20, when the blocksize is 4k. With a 512k blocksize it gets
closer to 1:4 with 32 writers and 32 readers, but this still has
undesirable throughput distribution, also 1:4.

Initially I blamed the increased seeking imposed by far2, due to needing
to write on both an inside and an outside track. So I then tested on an
offset2 layout, with 512k chunk size. But this has the same problem,
with similar throughput characteristics. Logfiles for those are also
included in the aforementioned tgz file.

Testing with the bfq scheduler shows it also has some undesired
behaviour, especially when there's only few readers / writers with small
block sizes. The 1- and 2-job versions have a 1:4 and 1:2 ratio for
read:writes, respectively, the 4-job version somehow has a 6:1 ratio,
then the 8-job version goes bananas with 1:15, whereas 16 and 32 go for
1:7 and 7:1. In short: bfq is all over the place when facing a 4k
blocksize.

On a 512k blocksize, initially it's a "meh"-ratio, 1:3, when facing 1
or 2 readers and writers. On 4 it's suddenly almost 1:1, but 8 reverts
to 1:10, 16 is 1:2 and 32 again 1:1. Again, bfq is all over the place,
but seems better than mq-deadline.

This feels like it shouldn't be the default behaviour. Personally, it
makes my machine pretty much unusable when heavy writing is going on.
I'd very much like to simply throttle the write throughput so that read
throughput is on par or better than it. It's not even so much about
latency of requests served, as it is about *amount* or *size* of read
requests served. Considering offset2 doesn't really improve matters in
the benchmark I'll probably stick with far2, so scheduler tweaking is my
current plan. Problem is, I have no clue what to tweak in order to
improve this. As mentioned at the start, any input, advice, testing
suggestions, scheduler tweaking tips, is appreciated.

Regards,

Pol Van Aubel

[1] As an aside, fio claims unrealistic write speeds for these disks
when testing sequential writes, but those are not the important
benchmarks here. I don't think it's due to disk cache, that's only
256MiB per disk. I mention it only for completeness.