Hi Andrea, On Thu, Jun 06, 2019 at 01:51:03PM -0400, Andrea Arcangeli wrote: > Hello, > > On Mon, Jun 03, 2019 at 07:52:43PM -0300, Marcelo Tosatti wrote: > > +unsigned int guest_halt_poll_ns = 200000; > > +module_param(guest_halt_poll_ns, uint, 0644); > > + > > +/* division factor to shrink halt_poll_ns */ > > +unsigned int guest_halt_poll_shrink = 2; > > +module_param(guest_halt_poll_shrink, uint, 0644); > > + > > +/* multiplication factor to grow per-cpu halt_poll_ns */ > > +unsigned int guest_halt_poll_grow = 2; > > +module_param(guest_halt_poll_grow, uint, 0644); > > + > > +/* value in ns to start growing per-cpu halt_poll_ns */ > > +unsigned int guest_halt_poll_grow_start = 10000; > > +module_param(guest_halt_poll_grow_start, uint, 0644); > > + > > +/* value in ns to start growing per-cpu halt_poll_ns */ > > +bool guest_halt_poll_allow_shrink = true; > > +module_param(guest_halt_poll_allow_shrink, bool, 0644); > > These variables can all be static. They also should be __read_mostly > to be sure not to unnecessarily hit false sharing while going idle. Fixed. > > > + while (!need_resched()) { > > + cpu_relax(); > > + now = ktime_get(); > > + > > + if (!ktime_before(now, end_spin)) { > > + do_halt = 1; > > + break; > > + } > > + } > > On skylake pause takes ~75 cycles with ple_gap=0 (and Marcelo found it > takes 6 cycles with pause loop exiting enabled but that shall be fixed > in the CPU and we can ignore it). Right, that is a generic problem. > So we could call ktime_get() only once every 100 times or more and > we'd be still accurate down to at least 1usec. > > Ideally we'd like a ktime_try_get() that will break the seqcount loop > if read_seqcount_retry fails. Something like below pseudocode: > > #define KTIME_ERR ((ktime_t) { .tv64 = 0 }) > > ktime_t ktime_try_get(void) > { > [..] > seq = read_seqcount_begin(&timekeeper_seq); > secs = tk->xtime_sec + tk->wall_to_monotonic.tv_sec; > nsecs = timekeeping_get_ns(&tk->tkr_mono) + > tk->wall_to_monotonic.tv_nsec; > if (unlikely(read_seqcount_retry(&timekeeper_seq, seq))) > return KTIME_ERR; > [..] > } > > If it ktime_try_get() fails we keep calling it at every iteration of > the loop, when finally it succeeds we call it again only after 100 > pause instructions or more. So we continue polling need_resched() > while we wait timerkeeper_seq to be released (and hopefully by looping > 100 times or more we'll reduce the frequency when we find > timekeeper_seq locked). > > All we care is to react to need_resched ASAP and to have a resolution > of the order of 1usec for the spin time. > > If 100 is too wired a new module parameter in __read_mostly section > configured to 100 or more by default, sounds preferable than hitting > every 75nsec on the timekeeper_seq seqcount cacheline. I'll switch to the cheaper sched_clock as suggested by Peter. Thanks! > > I doubt it'd make any measurable difference with a few vcpus, but with > hundred of host CPUs and vcpus perhaps it's worth it. > > This of course can be done later once the patch is merged and if it's > confirmed the above makes sense in practice and not just in theory. I > wouldn't want to delay the merging for a possible micro optimization. > > Reviewed-by: Andrea Arcangeli <aarcange@xxxxxxxxxx> > > Thanks, > Andrea