Re: kvm + raid1 showstopper bug

Jes Sorensen <Jes.Sorensen@xxxxxxxxxx> · Tue, 21 Feb 2012 13:40:28 +0100

On 02/18/12 14:25, Stefan Hajnoczi wrote:
> On Fri, Feb 17, 2012 at 3:31 PM, Pete Ashdown <pashdown@xxxxxxxxxxxx> wrote:
>> > On 02/17/2012 04:30 AM, Stefan Hajnoczi wrote:
>>> >> On Fri, Feb 17, 2012 at 4:57 AM, Pete Ashdown <pashdown@xxxxxxxxxxxx> wrote:
>>>> >>> I've been waiting for some response from the Ubuntu team regarding a bug on
>>>> >>> launchpad, but it appears that it isn't being taken seriously:
>>>> >>>
>>>> >>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/745785
>>> >> This looks interesting.  Let me try to summarize, please point out if
>>> >> I get something wrong:
>>> >>
>>> >> You have software RAID1 on the host, your disk images live on this
>>> >> device.  Whenever checkarray runs on the host you find that VMs become
>>> >> unresponsive.  Guests print warnings that a task is blocked for more
>>> >> than 120 seconds.  Guests become unresponsive on the network.
>> > In my case, it is drbd+RAID10, but the bug still applies.  It isn't
>> > whenever checkarray runs, but whenever checkarray decides to do a resync,
>> > it will block all IO somewhere before the end of the resync.  Then yes, it
>> > isn't long before the guests start to fail due to their inability to
>> > read/write.
> I have not attempted to reproduce this yet but have taken a look at
> drviers/md/raid10.c resync code.  md resync uses a similar mechanism
> for RAID1 and RAID10.  While a block is being synced the entire device
> will force regular I/O requests to wait.  There are tunables which let
> you rate-limit resyncing, I think this can solve your problem.
> Perhaps the resync is too aggressive and is impacting regular I/O so
> much that the guest is warning about it.  See Documentation/md.txt for
> sync_speed_max and other sysfs attributes.
> 
> The bug report suggests qemu-kvm itself is operating fine because the
> guest is still executing and VNC/monitor are alive.  After a while the
> guest warns about the stuck I/O.

It could be a bug in the raid1/raid10 code, which is triggered by the
way qemu keeps file on it (O_DIRECT or something). However there have
been a *lot* of fixes to the raid code since 2.6.38 (which is the kernel
version I saw referenced in the launchpad link). Please try and
reproduce it with a more uptodate kernel.

I have no idea what drbd8 is and what relation it has to the kernel or
kvm here.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html