Re: VM clock stopped after host suspend

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Thu, 23 Mar 2017 15:54:04 -0300

On Sun, Mar 19, 2017 at 03:42:16PM +0100, Marc Haber wrote:
> Hi,
> 
> I am running a bunch of test VMs on a host (with an AMD Phenom II X6
> 1090T Processor [I am afraid this matters]) using KVM. Host and guest
> OS is Debian unstable, and I'm running home-brewed kernel trying to
> stay close to Greg's stable releases. Disks are encrypted, so
> rebooting the machine from remote is a bit of a pain.
> 
> Since those are just test VMs and the host is also my home desktop
> machine, I suspend the host at night without caring for the VMs.
> Usually, this works fine with the VMs just chugging away again after
> waking up the host.
> 
> However, sometimes it happens that the clock in the VMs stays stopped
> after waking up the host. That means, date, wait 10 seconds, date,
> will yield the same output (the last datestamp of when the host was
> suspended), and a sleep call will never return to the shell.

Ok, so timekeeping in the guest is not functioning: either because
the services provided by the host necessary for timekeeping are 
not functional (such as timer interrupts), or because of a bug in the
guest timekeeping code.

> In this case, the VMs run just normally until they encounter a sleep
> call. In this case, the affected process will just sit still and wait
> for the sleep to return which never happens. If the job is still in
> foreground of shell session, aborting with ctrl-C works.
> 
> Of course this is not a desireable state of operation. The system is
> usually a candidate for the MagicSysRq BUSIER routine since a normal
> shutdown contains sleep calls...
> 
> I tried reproducing this on a test box that is eaasier to reboot to be
> able to bisect, but I was not able to reproduce the issue there. The
> test box has a Sandy Bridge i5 processor, which is the reason that I
> suspect that the CPU type matters. Sadly, I do not have a second
> Phenom available.
> 
> Has anybody ever encountered this situation? Any ideas how to debug
> this?

Never seen this before. To debug i would:

	1) enable the following tracepoints in the host:
		# echo kvm_inj_virq >
		# /sys/kernel/debug/tracing/set_event

	2) enable tracing for the following functions in the guest (for
	   this function, don't remember from the top of my 
	   head how to do it, search for set_ftrace_filter 
	   in the ftrace documentation):

		update_wall_time

This should let you know whether the host or the guest are at
fault.

What version of the host/guest is this again?