On Fri, Apr 05, 2024 at 07:42:35AM -0700, Sean Christopherson wrote: > On Fri, Apr 05, 2024, Marcelo Tosatti wrote: > > On Mon, Apr 01, 2024 at 01:21:25PM -0700, Sean Christopherson wrote: > > > On Thu, Mar 28, 2024, Leonardo Bras wrote: > > > > I am dealing with a latency issue inside a KVM guest, which is caused by > > > > a sched_switch to rcuc[1]. > > > > > > > > During guest entry, kernel code will signal to RCU that current CPU was on > > > > a quiescent state, making sure no other CPU is waiting for this one. > > > > > > > > If a vcpu just stopped running (guest_exit), and a syncronize_rcu() was > > > > issued somewhere since guest entry, there is a chance a timer interrupt > > > > will happen in that CPU, which will cause rcu_sched_clock_irq() to run. > > > > > > > > rcu_sched_clock_irq() will check rcu_pending() which will return true, > > > > and cause invoke_rcu_core() to be called, which will (in current config) > > > > cause rcuc/N to be scheduled into the current cpu. > > > > > > > > On rcu_pending(), I noticed we can avoid returning true (and thus invoking > > > > rcu_core()) if the current cpu is nohz_full, and the cpu came from either > > > > idle or userspace, since both are considered quiescent states. > > > > > > > > Since this is also true to guest context, my idea to solve this latency > > > > issue by avoiding rcu_core() invocation if it was running a guest vcpu. > > > > > > > > On the other hand, I could not find a way of reliably saying the current > > > > cpu was running a guest vcpu, so patch #1 implements a per-cpu variable > > > > for keeping the time (jiffies) of the last guest exit. > > > > > > > > In patch #2 I compare current time to that time, and if less than a second > > > > has past, we just skip rcu_core() invocation, since there is a high chance > > > > it will just go back to the guest in a moment. > > > > > > What's the downside if there's a false positive? > > > > rcuc wakes up (which might exceed the allowed latency threshold > > for certain realtime apps). > > Isn't that a false negative? (RCU doesn't detect that a CPU is about to (re)enter > a guest) I was trying to ask about the case where RCU thinks a CPU is about to > enter a guest, but the CPU never does (at least, not in the immediate future). > > Or am I just not understanding how RCU's kthreads work? It is quite possible that the current rcu_pending() code needs help, given the possibility of vCPU preemption. I have heard of people doing nested KVM virtualization -- or is that no longer a thing? But the help might well involve RCU telling the hypervisor that a given vCPU needs to run. Not sure how that would go over, though it has been prototyped a couple times in the context of RCU priority boosting. > > > > What I know it's weird with this patch: > > > > 1 - Not sure if this is the best way of finding out if the cpu was > > > > running a guest recently. > > > > > > > > 2 - This per-cpu variable needs to get set at each guest_exit(), so it's > > > > overhead, even though it's supposed to be in local cache. If that's > > > > an issue, I would suggest having this part compiled out on > > > > !CONFIG_NO_HZ_FULL, but further checking each cpu for being nohz_full > > > > enabled seems more expensive than just setting this out. > > > > > > A per-CPU write isn't problematic, but I suspect reading jiffies will be quite > > > imprecise, e.g. it'll be a full tick "behind" on many exits. > > > > > > > 3 - It checks if the guest exit happened over than 1 second ago. This 1 > > > > second value was copied from rcu_nohz_full_cpu() which checks if the > > > > grace period started over than a second ago. If this value is bad, > > > > I have no issue changing it. > > > > > > IMO, checking if a CPU "recently" ran a KVM vCPU is a suboptimal heuristic regardless > > > of what magic time threshold is used. > > > > Why? It works for this particular purpose. > > Because maintaining magic numbers is no fun, AFAICT the heurisitic doesn't guard > against edge cases, and I'm pretty sure we can do better with about the same amount > of effort/churn. Beyond a certain point, we have no choice. How long should RCU let a CPU run with preemption disabled before complaining? We choose 21 seconds in mainline and some distros choose 60 seconds. Android chooses 20 milliseconds for synchronize_rcu_expedited() grace periods. > > > IIUC, what you want is a way to detect if a CPU is likely to _run_ a KVM > > > vCPU in the near future. KVM can provide that information with much better > > > precision, e.g. KVM knows when when it's in the core vCPU run loop. > > > > ktime_t ktime_get(void) > > { > > struct timekeeper *tk = &tk_core.timekeeper; > > unsigned int seq; > > ktime_t base; > > u64 nsecs; > > > > WARN_ON(timekeeping_suspended); > > > > do { > > seq = read_seqcount_begin(&tk_core.seq); > > base = tk->tkr_mono.base; > > nsecs = timekeeping_get_ns(&tk->tkr_mono); > > > > } while (read_seqcount_retry(&tk_core.seq, seq)); > > > > return ktime_add_ns(base, nsecs); > > } > > EXPORT_SYMBOL_GPL(ktime_get); > > > > ktime_get() is more expensive than unsigned long assignment. > > Huh? What does ktime_get() have to do with anything? I'm suggesting something > like the below (wants_to_run is from an in-flight patch, > https://lore.kernel.org/all/20240307163541.92138-1-dmatlack@xxxxxxxxxx). Interesting. Some questions below, especially if we are doing nested virtualization. > --- > include/linux/context_tracking.h | 12 ++++++++++++ > include/linux/context_tracking_state.h | 3 +++ > kernel/rcu/tree.c | 9 +++++++-- > virt/kvm/kvm_main.c | 7 +++++++ > 4 files changed, 29 insertions(+), 2 deletions(-) > > diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h > index 6e76b9dba00e..59bc855701c5 100644 > --- a/include/linux/context_tracking.h > +++ b/include/linux/context_tracking.h > @@ -86,6 +86,16 @@ static __always_inline void context_tracking_guest_exit(void) > __ct_user_exit(CONTEXT_GUEST); > } > > +static inline void context_tracking_guest_start_run_loop(void) > +{ > + __this_cpu_write(context_tracking.in_guest_run_loop, true); > +} > + > +static inline void context_tracking_guest_stop_run_loop(void) > +{ > + __this_cpu_write(context_tracking.in_guest_run_loop, false); > +} > + > #define CT_WARN_ON(cond) WARN_ON(context_tracking_enabled() && (cond)) > > #else > @@ -99,6 +109,8 @@ static inline int ct_state(void) { return -1; } > static inline int __ct_state(void) { return -1; } > static __always_inline bool context_tracking_guest_enter(void) { return false; } > static __always_inline void context_tracking_guest_exit(void) { } > +static inline void context_tracking_guest_start_run_loop(void) { } > +static inline void context_tracking_guest_stop_run_loop(void) { } > #define CT_WARN_ON(cond) do { } while (0) > #endif /* !CONFIG_CONTEXT_TRACKING_USER */ > > diff --git a/include/linux/context_tracking_state.h b/include/linux/context_tracking_state.h > index bbff5f7f8803..629ada1a4d81 100644 > --- a/include/linux/context_tracking_state.h > +++ b/include/linux/context_tracking_state.h > @@ -25,6 +25,9 @@ enum ctx_state { > #define CT_DYNTICKS_MASK (~CT_STATE_MASK) > > struct context_tracking { > +#if IS_ENABLED(CONFIG_KVM) > + bool in_guest_run_loop; > +#endif > #ifdef CONFIG_CONTEXT_TRACKING_USER > /* > * When active is false, probes are unset in order > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index d9642dd06c25..303ae9ae1c53 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3937,8 +3937,13 @@ static int rcu_pending(int user) > if (rcu_nocb_need_deferred_wakeup(rdp, RCU_NOCB_WAKE)) > return 1; > > - /* Is this a nohz_full CPU in userspace or idle? (Ignore RCU if so.) */ > - if ((user || rcu_is_cpu_rrupt_from_idle()) && rcu_nohz_full_cpu()) > + /* > + * Is this a nohz_full CPU in userspace, idle, or likely to enter a > + * guest in the near future? (Ignore RCU if so.) > + */ > + if ((user || rcu_is_cpu_rrupt_from_idle() || > + __this_cpu_read(context_tracking.in_guest_run_loop)) && In the case of (user || rcu_is_cpu_rrupt_from_idle()), this CPU was in a quiescent just before the current scheduling-clock interrupt and will again be in a quiescent state right after return from this interrupt. This means that the grace-period kthread will be able to remotely sense this quiescent state, so that the current CPU need do nothing. In constrast, it looks like context_tracking.in_guest_run_loop instead means that when we return from this interrupt, this CPU will still be in a non-quiescent state. Now, in the nested-virtualization case, your point might be that the lower-level hypervisor could preempt the vCPU in the interrupt handler just as easily as in the .in_guest_run_loop code. Which is a good point. But I don't know of a way to handle this other than heuristics and maybe hinting to the hypervisor (which has been prototyped for RCU priority boosting). Maybe the time for such hinting has come? > + rcu_nohz_full_cpu()) And rcu_nohz_full_cpu() has a one-second timeout, and has for quite some time. > return 0; > > /* Is the RCU core waiting for a quiescent state from this CPU? */ > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index bfb2b52a1416..5a7efc669a0f 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -209,6 +209,9 @@ void vcpu_load(struct kvm_vcpu *vcpu) > { > int cpu = get_cpu(); > > + if (vcpu->wants_to_run) > + context_tracking_guest_start_run_loop(); At this point, if this is a nohz_full CPU, it will no longer report quiescent states until the grace period is at least one second old. > + > __this_cpu_write(kvm_running_vcpu, vcpu); > preempt_notifier_register(&vcpu->preempt_notifier); > kvm_arch_vcpu_load(vcpu, cpu); > @@ -222,6 +225,10 @@ void vcpu_put(struct kvm_vcpu *vcpu) > kvm_arch_vcpu_put(vcpu); > preempt_notifier_unregister(&vcpu->preempt_notifier); > __this_cpu_write(kvm_running_vcpu, NULL); > + And also at this point, if this is a nohz_full CPU, it will no longer report quiescent states until the grace period is at least one second old. > + if (vcpu->wants_to_run) > + context_tracking_guest_stop_run_loop(); > + > preempt_enable(); > } > EXPORT_SYMBOL_GPL(vcpu_put); > > base-commit: 619e56a3810c88b8d16d7b9553932ad05f0d4968 All of which might be OK. Just checking as to whether all of that was in fact the intent. Thanx, Paul