Re: [Qemu-devel] CFQ I/O starvation problem triggered by RHEL6.0 KVM guests

Stefan Hajnoczi <stefanha@xxxxxxxxx> · Fri, 9 Sep 2011 14:48:28 +0100

On Fri, Sep 9, 2011 at 10:00 AM, Takuya Yoshikawa
<yoshikawa.takuya@xxxxxxxxxxxxx> wrote:
> Vivek Goyal <vgoyal@xxxxxxxxxx> wrote:
>
>> So you are using both RHEL 6.0 in both host and guest kernel? Can you
>> reproduce the same issue with upstream kernels? How easily/frequently
>> you can reproduce this with RHEL6.0 host.
>
> Guests were CentOS6.0.
>
> I have only RHEL6.0 and RHEL6.1 test results now.
> I want to try similar tests with upstream kernels if I can get some time.
>
> With RHEL6.0 kernel, I heard that this issue was reproduced every time, 100%.
>
>> > On the host, we were running 3 linux guests to see if I/O from these guests
>> > would be handled fairly by host; each guest did dd write with oflag=direct.
>> >
>> > Guest virtual disk:
>> >   We used a host local disk which had 3 partitions, and each guest was
>> >   allocated one of these as dd write target.
>> >
>> > So our test was for checking if cfq could keep fairness for the 3 guests
>> > who shared the same disk.
>> >
>> > The result (strage starvation):
>> >   Sometimes, one guest dominated cfq for more than 10sec and requests from
>> >   other guests were not handled at all during that time.
>> >
>> > Below is the blktrace log which shows that a request to (8,27) in cfq2068S (*1)
>> > is not handled at all during cfq2095S and cfq2067S which hold requests to
>> > (8,26) are being handled alternately.
>> >
>> > *1) WS 104920578 + 64
>> >
>> > Question:
>> >   I guess that cfq_close_cooperator() was being called in an unusual manner.
>> >   If so, do you think that cfq is responsible for keeping fairness for this
>> >   kind of unusual write requests?
>>
>> - If two guests are doing IO to separate partitions, they should really
>>   not be very close (until and unless partitions are really small).
>
> Sorry for my lack of explanation.
>
> The IO was issued from QEMU and the cooperative threads were both for the same
> guest. In other words, QEMU was using two threads for one IO stream from the guest.
>
> As my blktrace log snippet showed, cfq2095S and cfq2067S treated one sequential
> IO; cfq2095S did 64KB, then cfq2067S did next 64KB, and so on.
>
>  These should be from the same guest because the target partition was same,
>  which was allocated to that guest.
>
> During the 10sec, this repetition continued without allowing others to interrupt.
>
> I know it is unnatural but sometimes QEMU uses two aio threads for issuing one
> IO stream.
>
>>
>> - Even if there are close cooperators, these queues are merged and they
>>   are treated as single queue from slice point of view. So cooperating
>>   queues should be merged and get a single slice instead of starving
>>   other queues in the system.
>
> I understand that close cooperators' queues should be merged, but in our test
> case, when the 64KB request was issued from one aio thread, the other thread's
> queue was empty; because these queues are for the same stream, next request
> could not come until current request got finished.
>
>  But this is complicated because it depends on the qemu block layer aio.
>
> I am not sure if cfq would try to merge the queues in such cases.

Looking at posix-aio-compat.c, QEMU's threadpool for asynchronous I/O,
this seems like a fairly generic issue.  Other applications may suffer
from this same I/O scheduler behavior.  It would be nice to create a
test case program which doesn't use QEMU at all.

QEMU has a queue of requests that need to be processed.  There is a
pool of threads that sleep until requests become available with
pthread_cond_timedwait(3).  When a request is added to the queue,
pthread_cond_signal(3) is called in order to wake one sleeping thread.

This bouncing pattern between two threads that you describe is
probably a result of pthread_cond_timedwait(3) waking up each thread
in alternating fashion.  So we get this pattern:

A  B  <-- threads
1     <-- I/O requests
   2
3
   4
5
   6
...

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html