On Thu, Oct 08, 2009 at 01:17:35AM +0200, Frederic Weisbecker wrote: > (Adding Thomas in Cc) > > > On Sat, Oct 03, 2009 at 08:12:05PM -0300, Marcelo Tosatti wrote: > > Michael, > > > > Can you please give the patch below a try please? (without acpi_pm timer > > or priority adjustments for the guest). > > > > On Tue, Sep 29, 2009 at 05:12:17PM +0400, Michael Tokarev wrote: > > > Hello. > > > > > > I'm having quite an.. unusable system here. > > > It's not really a regresssion with 0.11.0, > > > it was something similar before, but with > > > 0.11.0 and/or 2.6.31 it become much worse. > > > > > > The thing is that after some uptime, kvm > > > guest prints something like this: > > > > > > hrtimer: interrupt too slow, forcing clock min delta to 461487495 ns > > > > > > after which system (guest) speeed becomes > > > very slow. The above message is from > > > 2.6.31 guest running wiht 0.11.0 & 2.6.31 > > > host. Before I tried it with 0.10.6 and > > > 2.6.30 or 2.6.27, and the delta were a > > > bit less than that: > > > > > > hrtimer: interrupt too slow, forcing clock min delta to 152222415 ns > > > hrtimer: interrupt too slow, forcing clock min delta to 93629025 ns > > > > It seems the way hrtimer_interrupt_hanging calculates min_delta is > > wrong (especially to virtual machines). The guest vcpu can be scheduled > > out during the execution of the hrtimer callbacks (and the callbacks > > themselves can do operations that translate to blocking operations in > > the hypervisor). > > > > So high min_delta values can be calculated if, for example, a single > > hrtimer_interrupt run takes two host time slices to execute, while some > > other higher priority task runs for N slices in between. > > > > Using the hrtimer_interrupt execution time (which can be the worse > > case at any given time), as the min_delta is problematic. > > > > So simply increase min_delta_ns by 50% once every detected failure, > > which will eventually lead to an acceptable threshold (the algorithm > > should scale back to down lower min_delta, to adjust back to wealthier > > times, too). > > > > diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c > > index 49da79a..8997978 100644 > > --- a/kernel/hrtimer.c > > +++ b/kernel/hrtimer.c > > @@ -1234,28 +1234,20 @@ static void __run_hrtimer(struct hrtimer *timer) > > > > #ifdef CONFIG_HIGH_RES_TIMERS > > > > -static int force_clock_reprogram; > > - > > /* > > * After 5 iteration's attempts, we consider that hrtimer_interrupt() > > * is hanging, which could happen with something that slows the interrupt > > - * such as the tracing. Then we force the clock reprogramming for each future > > - * hrtimer interrupts to avoid infinite loops and use the min_delta_ns > > - * threshold that we will overwrite. > > - * The next tick event will be scheduled to 3 times we currently spend on > > - * hrtimer_interrupt(). This gives a good compromise, the cpus will spend > > - * 1/4 of their time to process the hrtimer interrupts. This is enough to > > - * let it running without serious starvation. > > + * such as the tracing, so we increase min_delta_ns. > > */ > > > > static inline void > > -hrtimer_interrupt_hanging(struct clock_event_device *dev, > > - ktime_t try_time) > > +hrtimer_interrupt_hanging(struct clock_event_device *dev) > > { > > - force_clock_reprogram = 1; > > - dev->min_delta_ns = (unsigned long)try_time.tv64 * 3; > > - printk(KERN_WARNING "hrtimer: interrupt too slow, " > > - "forcing clock min delta to %lu ns\n", dev->min_delta_ns); > > + dev->min_delta_ns += dev->min_delta_ns >> 1; > > > I haven't thought about the guest that could be scheduled out in > the middle of the timers servicing, making wrong this check based > of the time spent in hrtimer_interrupt(). > > I guess there is no easy/generic/cheap way to rebase this check > on the _virtual_ time spent in the timers servicing. By virtual, > I mean the time spent in the guest only. > > In a non-guest kernel, the old check forces an adaptive rate > sharing: > > - we spent n nanosecs to service the batch of timers. > - we are hanging > - we want at least 3/4 of time reserved for non-timer > servicing in the kernel, this is a minimum prerequisite > for the system to not starve > - adapt the min_clock_delta against to fit the above constraint > > All that does not make sense anymore in a guest. The hang detection > and warnings, the recalibrations of the min_clock_deltas are completely > wrong in this context. > Not only does it spuriously warn, but the minimum timer is increasing > slowly and the guest progressively suffers from higher and higher > latencies. > > That's really bad. > > Your patch lowers the immediate impact and makes this illness evolving > smoother by scaling down the recalibration to the min_clock_delta. > This appeases the bug but doesn't solve it. I fear it could be even > worse because it makes it more discreet. True. > May be can we instead increase the minimum threshold of loop in the > hrtimer interrupt before considering it as a hang? Hmm, but a too high > number could make this check useless, depending of the number of pending > timers, which is a finite number. > > Well, actually I'm not confident anymore in this check. Or actually we > should change it. May be we can rebase it on the time spent on the hrtimer > interrupt (and check it every 10 loops of reprocessing in hrtimer_interrupts). > > Would a mimimum threshold of 5 seconds spent in hrtimer_interrupt() be > a reasonable check to perform? > We should probably base our check on such kind of high boundary. > What we want is an ultimate rescue against hard hangs anyway, not > something that can solve the hang source itself. After the min_clock_delta > recalibration, the system will be unstable (eg: high latencies). > So if this must behave as a hammer, let's ensure we really need this hammer, > even if we need to wait for few seconds before it triggers. > > > What do you think? What about getting rid of the retry loop, instead? So something like: - run hrtimer callbacks (once) - while (tick_program_event(expires)) expires = ktime_add_ns(expires, dev->min_delta_ns) This way there's no static tuning involved. Its not clear to me why the loop is there in the first place. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html