Re: qemu-1.4.0 and onwards, linux kernel 3.2.x, ceph-RBD, heavy I/O leads to kernel_hung_tasks_timout_secs message and unresponsive qemu-process, [Qemu-devel] [Bug 1207686]

Sage Weil <sage@xxxxxxxxxxx> · Tue, 13 Aug 2013 14:34:45 -0700 (PDT)

Hi Oliver,

(Posted this on the bug too, but:)

Your last log revealed a bug in the librados aio flush.  A fix is pushed 
to wip-librados-aio-flush (bobtail) and wip-5919 (master); can you retest 
please (with caching off again)?

Thanks!
sage

On Fri, 9 Aug 2013, Oliver Francke wrote:
> Hi Josh,
> 
> just opened
> 
> http://tracker.ceph.com/issues/5919
> 
> with all collected information incl. debug-log.
> 
> Hope it helps,
> 
> Oliver.
> 
> On 08/08/2013 07:01 PM, Josh Durgin wrote:
> > On 08/08/2013 05:40 AM, Oliver Francke wrote:
> > > Hi Josh,
> > > 
> > > I have a session logged with:
> > > 
> > >      debug_ms=1:debug_rbd=20:debug_objectcacher=30
> > > 
> > > as you requested from Mike, even if I think, we do have another story
> > > here, anyway.
> > > 
> > > Host-kernel is: 3.10.0-rc7, qemu-client 1.6.0-rc2, client-kernel is
> > > 3.2.0-51-amd...
> > > 
> > > Do you want me to open a ticket for that stuff? I have about 5MB
> > > compressed logfile waiting for you ;)
> > 
> > Yes, that'd be great. If you could include the time when you saw the guest
> > hang that'd be ideal. I'm not sure if this is one or two bugs,
> > but it seems likely it's a bug in rbd and not qemu.
> > 
> > Thanks!
> > Josh
> > 
> > > Thnx in advance,
> > > 
> > > Oliver.
> > > 
> > > On 08/05/2013 09:48 AM, Stefan Hajnoczi wrote:
> > > > On Sun, Aug 04, 2013 at 03:36:52PM +0200, Oliver Francke wrote:
> > > > > Am 02.08.2013 um 23:47 schrieb Mike Dawson <mike.dawson@xxxxxxxxxxxx>:
> > > > > > We can "un-wedge" the guest by opening a NoVNC session or running a
> > > > > > 'virsh screenshot' command. After that, the guest resumes and runs
> > > > > > as expected. At that point we can examine the guest. Each time we'll
> > > > > > see:
> > > > If virsh screenshot works then this confirms that QEMU itself is still
> > > > responding.  Its main loop cannot be blocked since it was able to
> > > > process the screendump command.
> > > > 
> > > > This supports Josh's theory that a callback is not being invoked.  The
> > > > virtio-blk I/O request would be left in a pending state.
> > > > 
> > > > Now here is where the behavior varies between configurations:
> > > > 
> > > > On a Windows guest with 1 vCPU, you may see the symptom that the guest
> > > > no
> > > > longer responds to ping.
> > > > 
> > > > On a Linux guest with multiple vCPUs, you may see the hung task message
> > > > from the guest kernel because other vCPUs are still making progress.
> > > > Just the vCPU that issued the I/O request and whose task is in
> > > > UNINTERRUPTIBLE state would really be stuck.
> > > > 
> > > > Basically, the symptoms depend not just on how QEMU is behaving but also
> > > > on the guest kernel and how many vCPUs you have configured.
> > > > 
> > > > I think this can explain how both problems you are observing, Oliver and
> > > > Mike, are a result of the same bug.  At least I hope they are :).
> > > > 
> > > > Stefan
> > > 
> > > 
> > 
> 
> 
> -- 
> 
> Oliver Francke
> 
> filoo GmbH
> Moltkestra?e 25a
> 33330 G?tersloh
> HRB4355 AG G?tersloh
> 
> Gesch?ftsf?hrer: J.Rehp?hler | C.Kunz
> 
> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com