Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Fri, 5 Apr 2024 17:03:38 -0700

On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote:
> On Fri, Apr 05, 2024, Marcelo Tosatti wrote:
> > On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote:
> > > On Thu, Mar 28, 2024, Leonardo Bras wrote:
> > > > I am dealing with a latency issue inside a KVM guest, which is caused by
> > > > a sched_switch to rcuc[1].
> > > > 
> > > > During guest entry, kernel code will signal to RCU that current CPU was on
> > > > a quiescent state, making sure no other CPU is waiting for this one.
> > > > 
> > > > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was
> > > > issued somewhere since guest entry, there is a chance a timer interrupt
> > > > will happen in that CPU, which will cause rcu_sched_clock_irq() to run.
> > > > 
> > > > rcu_sched_clock_irq() will check rcu_pending() which will return true,
> > > > and cause invoke_rcu_core() to be called, which will (in current config)
> > > > cause rcuc/N to be scheduled into the current cpu.
> > > > 
> > > > On rcu_pending(), I noticed we can avoid returning true (and thus invoking
> > > > rcu_core()) if the current cpu is nohz_full, and the cpu came from either
> > > > idle or userspace, since both are considered quiescent states.
> > > > 
> > > > Since this is also true to guest context, my idea to solve this latency
> > > > issue by avoiding rcu_core() invocation if it was running a guest vcpu.
> > > > 
> > > > On the other hand, I could not find a way of reliably saying the current
> > > > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable
> > > > for keeping the time (jiffies) of the last guest exit.
> > > > 
> > > > In patch #2 I compare current time to that time, and if less than a second
> > > > has past, we just skip rcu_core() invocation, since there is a high chance
> > > > it will just go back to the guest in a moment.
> > > 
> > > What's the downside if there's a false positive?
> > 
> > rcuc wakes up (which might exceed the allowed latency threshold
> > for certain realtime apps).
> 
> Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter
> a guest)  I was trying to ask about the case where RCU thinks a CPU is about to
> enter a guest, but the CPU never does (at least, not in the immediate future).
> 
> Or am I just not understanding how RCU's kthreads work?

It is quite possible that the current rcu_pending() code needs help,
given the possibility of vCPU preemption.  I have heard of people doing
nested KVM virtualization -- or is that no longer a thing?

But the help might well involve RCU telling the hypervisor that a given
vCPU needs to run.  Not sure how that would go over, though it has been
prototyped a couple times in the context of RCU priority boosting.

> > > > What I know it's weird with this patch:
> > > > 1 - Not sure if this is the best way of finding out if the cpu was
> > > >     running a guest recently.
> > > > 
> > > > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's
> > > >     overhead, even though it's supposed to be in local cache. If that's
> > > >     an issue, I would suggest having this part compiled out on 
> > > >     !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full
> > > >     enabled seems more expensive than just setting this out.
> > > 
> > > A per-CPU write isn't problematic, but I suspect reading jiffies will be quite
> > > imprecise, e.g. it'll be a full tick "behind" on many exits.
> > > 
> > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1
> > > >     second value was copied from rcu_nohz_full_cpu() which checks if the
> > > >     grace period started over than a second ago. If this value is bad,
> > > >     I have no issue changing it.
> > > 
> > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless
> > > of what magic time threshold is used.  
> > 
> > Why? It works for this particular purpose.
> 
> Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard
> against edge cases, and I'm pretty sure we can do better with about the same amount
> of effort/churn.

Beyond a certain point, we have no choice.  How long should RCU let
a CPU run with preemption disabled before complaining?  We choose 21
seconds in mainline and some distros choose 60 seconds.  Android chooses
20 milliseconds for synchronize_rcu_expedited() grace periods.

> > > IIUC, what you want is a way to detect if a CPU is likely to _run_ a KVM
> > > vCPU in the near future.  KVM can provide that information with much better
> > > precision, e.g. KVM knows when when it's in the core vCPU run loop.
> > 
> > ktime_t ktime_get(void)
> > {
> >         struct timekeeper *tk = &tk_core.timekeeper;
> >         unsigned int seq;
> >         ktime_t base;
> >         u64 nsecs;
> > 
> >         WARN_ON(timekeeping_suspended);
> > 
> >         do {
> >                 seq = read_seqcount_begin(&tk_core.seq);
> >                 base = tk->tkr_mono.base;
> >                 nsecs = timekeeping_get_ns(&tk->tkr_mono);
> > 
> >         } while (read_seqcount_retry(&tk_core.seq, seq));
> > 
> >         return ktime_add_ns(base, nsecs);
> > }
> > EXPORT_SYMBOL_GPL(ktime_get);
> > 
> > ktime_get() is more expensive than unsigned long assignment.
> 
> Huh?  What does ktime_get() have to do with anything?  I'm suggesting something
> like the below (wants_to_run is from an in-flight patch,
> https://lore.kernel.org/all/20240307163541.92138-1-dmatlack@xxxxxxxxxx).

Interesting.  Some questions below, especially if we are doing nested
virtualization.

> ---
>  include/linux/context_tracking.h       | 12 ++++++++++++
>  include/linux/context_tracking_state.h |  3 +++
>  kernel/rcu/tree.c                      |  9 +++++++--
>  virt/kvm/kvm_main.c                    |  7 +++++++
>  4 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
> index 6e76b9dba00e..59bc855701c5 100644
> --- a/include/linux/context_tracking.h
> +++ b/include/linux/context_tracking.h
> @@ -86,6 +86,16 @@ static __always_inline void context_tracking_guest_exit(void)
>  		__ct_user_exit(CONTEXT_GUEST);
>  }
>  
> +static inline void context_tracking_guest_start_run_loop(void)
> +{
> +	__this_cpu_write(context_tracking.in_guest_run_loop, true);
> +}
> +
> +static inline void context_tracking_guest_stop_run_loop(void)
> +{
> +	__this_cpu_write(context_tracking.in_guest_run_loop, false);
> +}
> +
>  #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond))
>  
>  #else
> @@ -99,6 +109,8 @@ static inline int ct_state(void) { return -1; }
>  static inline int __ct_state(void) { return -1; }
>  static __always_inline bool context_tracking_guest_enter(void) { return false; }
>  static __always_inline void context_tracking_guest_exit(void) { }
> +static inline void context_tracking_guest_start_run_loop(void) { }
> +static inline void context_tracking_guest_stop_run_loop(void) { }
>  #define CT_WARN_ON(cond) do { } while (0)
>  #endif /* !CONFIG_CONTEXT_TRACKING_USER */
>  
> diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h
> index bbff5f7f8803..629ada1a4d81 100644
> --- a/include/linux/context_tracking_state.h
> +++ b/include/linux/context_tracking_state.h
> @@ -25,6 +25,9 @@ enum ctx_state {
>  #define CT_DYNTICKS_MASK (~CT_STATE_MASK)
>  
>  struct context_tracking {
> +#if IS_ENABLED(CONFIG_KVM)
> +	bool in_guest_run_loop;
> +#endif
>  #ifdef CONFIG_CONTEXT_TRACKING_USER
>  	/*
>  	 * When active is false, probes are unset in order
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index d9642dd06c25..303ae9ae1c53 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3937,8 +3937,13 @@ static int rcu_pending(int user)
>  	if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE))
>  		return 1;
>  
> -	/* Is this a nohz_full CPU in userspace or idle?  (Ignore RCU if so.) */
> -	if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu())
> +	/*
> +	 * Is this a nohz_full CPU in userspace, idle, or likely to enter a
> +	 * guest in the near future?  (Ignore RCU if so.)
> +	 */
> +	if ((user || rcu_is_cpu_rrupt_from_idle() ||
> +	     __this_cpu_read(context_tracking.in_guest_run_loop)) &&

In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in
a quiescent just before the current scheduling-clock interrupt and will
again be in a quiescent state right after return from this interrupt.
This means that the grace-period kthread will be able to remotely sense
this quiescent state, so that the current CPU need do nothing.

In constrast, it looks like context_tracking.in_guest_run_loop instead
means that when we return from this interrupt, this CPU will still be
in a non-quiescent state.

Now, in the nested-virtualization case, your point might be that the
lower-level hypervisor could preempt the vCPU in the interrupt handler
just as easily as in the .in_guest_run_loop code.  Which is a good point.
But I don't know of a way to handle this other than heuristics and maybe
hinting to the hypervisor (which has been prototyped for RCU priority
boosting).

Maybe the time for such hinting has come?

> +	    rcu_nohz_full_cpu())

And rcu_nohz_full_cpu() has a one-second timeout, and has for quite
some time.

>  		return 0;
>  
>  	/* Is the RCU core waiting for a quiescent state from this CPU? */
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index bfb2b52a1416..5a7efc669a0f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu)
>  {
>  	int cpu = get_cpu();
>  
> +	if (vcpu->wants_to_run)
> +		context_tracking_guest_start_run_loop();

At this point, if this is a nohz_full CPU, it will no longer report
quiescent states until the grace period is at least one second old.

> +
>  	__this_cpu_write(kvm_running_vcpu, vcpu);
>  	preempt_notifier_register(&vcpu->preempt_notifier);
>  	kvm_arch_vcpu_load(vcpu, cpu);
> @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu)
>  	kvm_arch_vcpu_put(vcpu);
>  	preempt_notifier_unregister(&vcpu->preempt_notifier);
>  	__this_cpu_write(kvm_running_vcpu, NULL);
> +

And also at this point, if this is a nohz_full CPU, it will no longer
report quiescent states until the grace period is at least one second old.

> +	if (vcpu->wants_to_run)
> +		context_tracking_guest_stop_run_loop();
> +
>  	preempt_enable();
>  }
>  EXPORT_SYMBOL_GPL(vcpu_put);
> 
> base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968

All of which might be OK.  Just checking as to whether all of that was
in fact the intent.

							Thanx, Paul