Re: Long delays for I/Os using CFQ scheduler

Douglas Miller <dougmill@xxxxxxxxxxxxxxxxxx> · Fri, 18 Aug 2017 07:13:27 -0500

On 08/15/2017 10:20 AM, Douglas Miller wrote:
I present a lot of information here, hopefully it is not too much. 
Perhaps this should be put in a bugzilla. I'm not familiar with CFQ, 
or other I/O schedulers, so have been doing more generic debugging.

PROBLEM:

I am observing some undesirable behavior in the CFQ I/O scheduler. I 
am seeing that a subset of the I/Os are experiencing much longer 
delays than the rest. The behavior is not observed when using the 
"deadline" scheduler.

BACKGROUND:

I am running on Ubuntu Zesty using kernel 4.10.0-30, with the commit 
mentioned next. Ubuntu sets the default I/O scheduler to "cfq".

I am running the POWER architecture test suite "HTX" 
(https://github.com/open-power/HTX), and only certain phases of that 
test seem to perturb CFQ. During these phases, I see that a small 
percentage of the total I/Os are getting delayed. Without commit 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5be6b75610cefd1e21b98a218211922c2feb6e08 
I, of course, see very long delays (sometimes over an hour). With that 
commit, the delays seem to max-out around 110 seconds - which to me 
still seems too long. HTX monitors each I/O's progress, and will start 
complaining if an I/O takes longer than 10 minutes to complete. With 
this commit, no I/Os seem to reach that 10 minute threshold and so HTX 
no longer complains.

In HTX, I am running the I/O-only variation (mdt.io) and in addition I 
disable all devices except for a few disks (otherwise unused). Early 
debugging showed that, in the case of the delayed I/Os, the delay is 
due to the I/O not getting submitted to the SCSI layer 
(scsi_dispatch_cmd). When the I/O eventually does get submitted to 
SCSI, it completes promptly.

The test phase was shown to be using only __blkdev_direct_IO_simple, 
so I put the debugging in that path. I replaced the call to 
io_schedule() with a debug routine that calls "io_schedule_timeout(HZ 
* 60)". If the timeout trips, then some debugging info is printed and 
io_schedule() is called to wait for eventual completion. Upon 
completion of a timed-out I/O, more debug is printed - including the 
total time elapsed. If no timeout occurs, no debugging is done.

OBSERVATIONS:

The period where I observe delayed I/Os seems to span about 2 hours. 
During that time, between 400-500 I/Os trip the 60-second timeout (4 
disks under test). It is estimated that about 2 million I/Os (per 
disk) occur during this interval. Earlier testing showed that a 
timeout of 30 seconds traps many more I/Os, so it is likely that the 
delays experienced by I/Os varies over the full range.

Both reads and writes can experience these delays, although most of 
the I/Os going on are reads.

What I am observing is that when HTX first starts (when it ramps-up 
enough) I/Os will start tripping the timeout. In this first phase, the 
I/Os seem to all complete in less than 70 seconds. The next phase is 
not reached for another 10 hours, but at that time the maximum I/O 
delay is over 90 seconds. Each successive phase shows maximum delay 
values that, in general, increase. The 6th cycle I observed over 110 
second delays, after which that maximum value declined. The 10th cycle 
(last one collected), showed a maximum of just under 100 seconds.

Test runs with the scheduler set to "deadline" never tripped the timeout.

QUESTIONS:

Is this behavior expected/correct? Are there other commits that might 
address this? Is there more information I can provide to help diagnose?

My concern is that certain processes/threads are getting unfairly 
delayed due to this behavior.

Thanks,

Doug

Some additional information. Without any patches, I see delays exceeding 
the 10-minute HTX threshold. With only this patch:

5be6b75610cefd1e21b98a218211922c2feb6e08 "cfq-iosched: fix the delay of 
cfq_group's vdisktime under iops mode"

I see some improvement of the I/Os delays (all are below 2 minutes), 
although I think the delays are still too long. But by adding these two 
patches:

4d608baac5f4e72b033a122b2d6d9499532c3afc "block: Initialize 
cfqq->ioprio_class in cfq_get_queue()"
142bbdfccc8b3e9f7342f2ce8422e76a3b45beae "cfq: Disable writeback 
throttling by default"

I regress back to the  long delays exceeding 10 minutes. I am guessing 
it is the writeback throttling patch which is causing the regression, 
but I don't understand enough of this code to see why.

Does anyone have information on CFQ scheduler and how to address these 
delays?