On Sat, Mar 13, 2021 at 8:58 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Wed, Mar 10, 2021, Haiwei Li wrote: > > On Wed, Mar 10, 2021 at 7:42 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Wed, Mar 03, 2021, Haiwei Li wrote: > > > > On 21/3/3 10:09, lihaiwei.kernel@xxxxxxxxx wrote: > > > > > From: Haiwei Li <lihaiwei@xxxxxxxxxxx> > > > > > > > > > > In my test environment, advance_expire_delta is frequently greater than > > > > > the fixed LAPIC_TIMER_ADVANCE_ADJUST_MAX. And this will hinder the > > > > > adjustment. > > > > > > > > Supplementary details: > > > > > > > > I have tried to backport timer related features to our production > > > > kernel. > > > > > > > > After completed, i found that advance_expire_delta is frequently greater > > > > than the fixed value. It's necessary to trun the fixed to dynamically > > > > values. > > > > > > Does this reproduce on an upstream kernel? If so... > > > > > > 1. How much over the 10k cycle limit is the delta? > > > 2. Any idea what causes the large delta? E.g. is there something that can > > > and/or should be fixed elsewhere? > > > 3. Is it platform/CPU specific? > > > > Hi, Sean > > > > I have traced the flow on our production kernel and it frequently consumes more > > than 10K cycles from sched_out to sched_in. > > So two scenarios tested on Cascade lake Server(96 pcpu), v5.11 kernel. > > > > 1. only cyclictest in guest(88 vcpu and bound with isolated pcpus, w/o mwait > > exposed, adaptive advance lapic timer is default -1). The ratio of occurrences: > > > > greater_than_10k/total: 29/2060, 1.41% > > > > 2. cyclictest in guest(88 vcpu and not bound, w/o mwait exposed, adaptive > > advance lapic timer is default -1) and stress in host(no isolate). The ratio of > > occurrences: > > > > greater_than_10k/total: 122381/1017363, 12.03% > > Hmm, I'm inclined to say this is working as intended. If the vCPU isn't affined > and/or it's getting preempted, then large spikes are expected, and not adjusting > in reaction to those spikes is desirable. E.g. adjusting by 20k cycles because > the timer happened to expire while a vCPU was preempted will cause KVM to busy > wait for quite a long time if the next timer runs without interference, and then > KVM will thrash the advancement. > > And I don't really see the point in pushing the max adjustment beyond 10k. The > max _advancement_ is 5000ns, which means that even with a blazing fast 5.0ghz > system, a max adjustment of 1250 (10k/ 8, the step divisor) should get KVM to > the 25000 cycle advancement limit relatively quickly. Since KVM resets to the > initial 1000ns advancement when it would exceed the 5000ns max, I suspect that > raising the max adjustment much beyond 10k cycles would quickly push a vCPU to > the max, cause it to reset, and rinse and repeat. > > Note, we definitely don't want to raise the 5000ns max, as waiting with IRQs > disabled for any longer than that will likely cause system instability. I see. Thanks for your explanation. -- Haiwei Li