Re: watchdog: print stolen time increment at softlockup detection

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Wed, 3 Jul 2013 23:15:26 -0300

On Wed, Jul 03, 2013 at 12:44:01PM -0400, Don Zickus wrote:
> On Fri, Jun 28, 2013 at 05:37:39PM -0300, Marcelo Tosatti wrote:
> > On Fri, Jun 28, 2013 at 10:12:15AM -0400, Don Zickus wrote:
> > > On Thu, Jun 27, 2013 at 11:57:23PM -0300, Marcelo Tosatti wrote:
> > > > 
> > > > One possibility for a softlockup report in a Linux VM, is that the host
> > > > system is overcommitted to the point where the watchdog task is unable
> > > > to make progress (unable to touch the watchdog).
> > > 
> > > I think I am confused on the VM/host stuff.  How does an overcommitted
> > > host prevent a high priority task like the watchdog from running?
> > > 
> > > Or is it the watchdog task on the VM that is being blocked from running
> > > because the host is overcommitted and can't run the VM frequent enough?
> > 
> > Yes, thats the case.
> > 
> > > The latter would make sense, though I thought you solved that with the
> > > other kvm splat in the watchdog code a while ago.  So I would be
> > > interested in understanding why the previous solution isn't working.
> > 
> > That functionality is for a notification so the guest ignores the time
> > jump induced by a vm pause. This problem is similar to the kgdb case.
> > 
> > > Second, I am still curious how this problem differs from say kgdb or
> > > suspend-hibernate/resume.  Doesn't both of those scenarios deal with a
> > > clock that suddenly jumps forward without the watchdog task running?
> > 
> > The difference is this:
> > 
> > The present functionality in watchdog.c allows the hypervisor to notify
> > the guest that it should ignore the large delta seen via clock reads
> > (at the watchdog timer interrupt).
> > This notification is used for the case where the vm has been paused for
> > a period of time.
> 
> But why do this at the watchdog timer interrupt?  I thought this would be
> done at the lower layer like in sched_clock() or something.
> 
> > 
> > Are you suggesting the host should silence the guest watchdog, also in
> > the overcommitment case? Issues i see with that:
> > 
> > 1) The host is not aware of the variable softlockup threshold in
> > the guest.
> > 
> > 2) Whatever the threshold of overcommitment for sending the ignore
> > softlockup notification to the guest, genuine softlockup detections in
> > the guest could be silenced, given proper conditioning.
> 
> No.  That would be difficult as you described.  What I am trying to get at
> is, doesn't the guest /know/ time jumped when it schedules again?  And
> can't it determine based on this jump that something unreasonable
> happened like a long pause or and overcommit?

A large jump alone is not enough information to reset the watchdog(s).

For example for this large jump scenario:

1. guest instruction exits to host for emulation.
2. emulation completes after 10 minutes, resumes execution at 
next instruction.
3. watchdog detects jump and prints a warning.

If the jump is due to inefficiency or incorrect emulation, the message
should be printed.
If the jump is due to a vm pause, the message should not be printed.

> > And why overcommitment is not a valid reason to generate a softlockup in
> > the first place ?
> 
> For the guest I don't believe it is.  It isn't the guest's fault it
> couldn't run processes.  A warning should be scheduled on the host that it
> couldn't run a process in a very long time.
>
> > > For some reason I had the impression that when a VM starts running again,
> > > one of the first things it does it sync up its clock again (which leads to
> > > a softlockup shortly thereafter in the case of paused/overcommitted VMs)?
> > 
> > Sort of, the kvmclock counts while the VM is running (whether is
> > overcommitted or not).
> 
> Does comparing the kvmclock with the current clock indicate that a long
> pause or an overcommit occurred?

By current clock you mean system clock? sched_clock() reads from
kvmclock.

> > > At that time I would have thought that the code could detect a large jump
> > > in time and touch_softlockup_watchdog_sync() or something to delay the
> > > check until the next cycle.
> > 
> > But this would silence any softlockups that are due to delays
> > in the host causing the watchdog task to make progress (eg:
> > https://lkml.org/lkml/2013/6/20/633, in that case if 1 operation took
> > longer than expected your suggestion would silence the report).
> 
> Ok.  I don't fully understand that problem, the changelog was a little
> vague.

That problem is described in the large jump scenario with guest
instruction exiting for emulation (in the beginning of this message).

> > > That would make the watchdog code alot less messier than having custom
> > > kvm/paravirt splat all over it.  Generic solutions are always nice. :-)
> > 
> > Can you give more detail on what the suggestion is and how can you deal
> > with points 1 and 2 above?
> 
> I don't have a good suggestion, just a lot of questions really.  The thing
> is there are lots of watchdogs in the system (ie clock watchdog,
> filesystem watchdog, rcu stalls, etc).  Solving this problem just for the lockup
> watchdog doesn't seem right because if the lockup timeout was longer, you
> would probably hit the other watchdogs too.

Agree. However, can't see how there is a way around "having custom
kvm/paravirt splat all over", for watchdogs that do:

1. check for watchdog resets
2. read time via sched_clock or xtime.
3. based on 2, decide whether there has been a longer delay than
acceptable.

This is the case for the softlockup timer interrupt. So the splat there
is necessary (otherwise any potential notification of vm-pause event 
noticed at 2 might be missed because its checked at 1).

For watchdogs that measure time based on interrupt event (such as hung
task, rcu_cpu_stall, checking for the notification at sched_clock or
lower is fine).

> So my suggestion (based on my ignorance of how the clock code works) is
> that some sort of generic mechanism be applied to all the watchdogs.  Much
> like how kgdb touches all of them at once when it handles an exception.
> 
> For example, unpausing a guest could be a good time to touch all the
> watchdogs as you have no idea how long the pause was.  I can't think of
> any hook for an overcommit though.

Its a good suggestion - will write a patch to touch watchdogs at read
of kvmclock.

Thanks!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html