On Fri, 17 Nov 2017, Quan Xu wrote: > On 2017-11-16 17:53, Thomas Gleixner wrote: > > That's just plain wrong. We don't want to see any of this PARAVIRT crap in > > anything outside the architecture/hypervisor interfacing code which really > > needs it. > > > > The problem can and must be solved at the generic level in the first place > > to gather the data which can be used to make such decisions. > > > > How that information is used might be either completely generic or requires > > system specific variants. But as long as we don't have any information at > > all we cannot discuss that. > > > > Please sit down and write up which data needs to be considered to make > > decisions about probabilistic polling. Then we need to compare and contrast > > that with the data which is necessary to make power/idle state decisions. > > > > I would be very surprised if this data would not overlap by at least 90%. > > > 1. which data needs to considerd to make decisions about probabilistic polling > > I really need to write up which data needs to considerd to make > decisions about probabilistic polling. At last several months, > I always focused on the data _from idle to reschedule_, then to bypass > the idle loops. unfortunately, this makes me touch scheduler/idle/nohz > code inevitably. > > with tglx's suggestion, the data which is necessary to make power/idle > state decisions, is the last idle state's residency time. IIUC this data > is duration from idle to wakeup, which maybe by reschedule irq or other irq. That's part of the picture, but not complete. > I also test that the reschedule irq overlap by more than 90% (trace the > need_resched status after cpuidle_idle_call), when I run ctxsw/netperf for > one minute. > > as the overlap, I think I can input the last idle state's residency time > to make decisions about probabilistic polling, as @dev->last_residency does. > it is much easier to get data. That's only true for your particular use case. > > 2. do a HV specific idle driver (function) > > so far, power management is not exposed to guest.. idle is simple for KVM > guest, > calling "sti" / "hlt"(cpuidle_idle_call() --> default_idle_call()).. > thanks Xen guys, who has implemented the paravirt framework. I can implement > it > as easy as following: > > --- a/arch/x86/kernel/kvm.c Your email client is using a very strange formatting. > +++ b/arch/x86/kernel/kvm.c > @@ -465,6 +465,12 @@ static void __init kvm_apf_trap_init(void) > update_intr_gate(X86_TRAP_PF, async_page_fault); > } > > +static __cpuidle void kvm_safe_halt(void) > +{ > + /* 1. POLL, if need_resched() --> return */ > + > + asm volatile("sti; hlt": : :"memory"); /* 2. halt */ > + > + /* 3. get the last idle state's residency time */ > + > + /* 4. update poll duration based on last idle state's > residency time */ > +} > + > void __init kvm_guest_init(void) > { > int i; > @@ -490,6 +496,8 @@ void __init kvm_guest_init(void) > if (kvmclock_vsyscall) > kvm_setup_vsyscall_timeinfo(); > > + pv_irq_ops.safe_halt = kvm_safe_halt; > + > #ifdef CONFIG_SMP > > > then, I am no need to introduce a new pvops, and never modify > schedule/idle/nohz code again. > also I can narrow all of the code down in arch/x86/kernel/kvm.c. > > If this is in the right direction, I will send a new patch set next week.. This is definitely better than what you proposed so far and implementing it as a prove of concept seems to be worthwhile. But I doubt that this is the final solution. It's not generic and not necessarily suitable for all use case scenarios. Thanks, tglx