Re: qemu takes 100% of a core, freezes the VM

Stefan Hajnoczi <stefanha@xxxxxxxxx> · Thu, 8 Feb 2018 09:24:59 +0000

On Fri, Feb 02, 2018 at 10:52:34AM -0500, JimR wrote:
> On 02/01/2018 11:52 AM, Stefan Hajnoczi wrote:
> > On Wed, Jan 31, 2018 at 10:56:47AM -0500, JimR wrote:
> > > Host:  Fedora 26 with all patches on HP Pavilion 4-core 3.2 GHz
> > > 
> > > VMM 1.4.3
> > > 
> > > Guest: RHEL 7.4, server with GUI. (also CentOS 7 server with GUI, but never
> > > running at the same time as rhel)
> > > 
> > > Guest invariably freezes, sometimes after 5 minutes, sometimes after 45
> > > minutes.  It will not accept any keyboard nor mouse input.  This happens
> > > when the only application running in guest is the terminal, but it is not
> > > running anything, just waiting for my next command.
> > > 
> > > VMM shows CPU usage spikes and stays there.  Host htop shows qemu is taking
> > > 100% of one core.
> > Please post the output of "mpstat -P ALL 1".  mpstat is from the sysstat
> > package.
> > 
> > If you see 100% %usr then QEMU is spinning.
> > 
> > If you see 100% %guest then the guest is spinning.
> > 
> > The next step would be to drill down on what activity is taking 100%
> > CPU.
> > 
> > Have you installed the latest updates on the host and inside the guest?
> > 
> > Stefan
> 
> I'm not sure if you received my reply yesterday.  It had a screen shot of
> htop embedded.  That seemed to bounce from majordomo.

The mpstat output you posted had 100% %guest and low %user utilization.
This suggests the hang is not within the QEMU process on the host.  It's
the guest that is consuming a lot of CPU.

> Here is some additional info from a freeze this morning.  This is from the
> guest's /var/log/messages.  Note that the Call Trace repeated 7 times just
> before the freeze.
> 
> Feb  2 09:39:48 rhcsa kernel: INFO: task systemd:1 blocked for more than 120
> seconds.
> Feb  2 09:39:48 rhcsa kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb  2 09:39:48 rhcsa kernel: systemd         D ffff88003dba0000 0    
> 1      0 0x00000000
> Feb  2 09:39:48 rhcsa kernel: Call Trace:
> Feb  2 09:39:48 rhcsa kernel: [<ffffffff816ab8a9>] schedule+0x29/0x70
> Feb  2 09:39:48 rhcsa kernel: [<ffffffff810625bf>]
> kvm_async_pf_task_wait+0x1df/0x230
> Feb  2 09:39:48 rhcsa kernel: [<ffffffff810b3690>] ?
> wake_up_atomic_t+0x30/0x30
> Feb  2 09:39:48 rhcsa kernel: [<ffffffff816afc00>] ? error_swapgs+0x61/0x18d
> Feb  2 09:39:48 rhcsa kernel: [<ffffffff816afcef>] ?
> error_swapgs+0x150/0x18d
> Feb  2 09:39:48 rhcsa kernel: [<ffffffff816b32d6>]
> do_async_page_fault+0x96/0xd0
> Feb  2 09:39:48 rhcsa kernel: [<ffffffff816af928>]
> async_page_fault+0x28/0x30
> Feb  2 09:39:48 rhcsa kernel: INFO: task crond:999 blocked for more than 120
> seconds.
> Feb  2 09:39:48 rhcsa kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb  2 09:39:48 rhcsa kernel: crond           D ffff880036421fa0 0  
> 999      1 0x00000080

Weird, looks like the guest took a page fault and hung when trying to
schedule another task while the hypervisor resolves the page fault.

I hope someone else has ideas on what to check next.

My next idea is low-level debugging and might be too time-consuming for
you:

I would use "perf record -a kvm:\*" on the host while the guest is hung
and then "perf script" to view the trace log.  It contains all
vmenter/vmexit activity and might contain a clue about what the guest is
trying to do.

The "perf kvm" command might be useful in showing what's going on inside
the guest.  It profiles CPU activity inside the guest kernel.

Stefan
Attachment:
signature.asc

Description: PGP signature