Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

Leonardo Bras <leobras@xxxxxxxxxx> · Wed, 8 May 2024 03:19:01 -0300

On Tue, May 07, 2024 at 08:22:42PM -0700, Paul E. McKenney wrote:
> On Tue, May 07, 2024 at 11:51:15PM -0300, Leonardo Bras wrote:
> > On Tue, May 07, 2024 at 05:08:54PM -0700, Sean Christopherson wrote:
> > > On Tue, May 07, 2024, Sean Christopherson wrote:
> > > > On Tue, May 07, 2024, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > > > But if we do need RCU to be more aggressive about treating guest execution as
> > > > > an RCU quiescent state within the host, that additional check would be an
> > > > > excellent way of making that happen.
> > > > 
> > > > It's not clear to me that being more agressive is warranted.  If my understanding
> > > > of the existing @user check is correct, we _could_ achieve similar functionality
> > > > for vCPU tasks by defining a rule that KVM must never enter an RCU critical section
> > > > with PF_VCPU set and IRQs enabled, and then rcu_pending() could check PF_VCPU.
> > > > On x86, this would be relatively straightforward (hack-a-patch below), but I've
> > > > no idea what it would look like on other architectures.
> > > > 
> > > > But the value added isn't entirely clear to me, probably because I'm still missing
> > > > something.  KVM will have *very* recently called __ct_user_exit(CONTEXT_GUEST) to
> > > > note the transition from guest to host kernel.  Why isn't that a sufficient hook
> > > > for RCU to infer grace period completion?
> > 
> > This is one of the solutions I tested when I was trying to solve the bug:
> > - Report quiescent state both in guest entry & guest exit.
> > 
> > It improves the bug, but has 2 issues compared to the timing alternative:
> > 1 - Saving jiffies to a per-cpu local variable is usually cheaper than 
> >     reporting a quiescent state
> > 2 - If we report it on guest_exit() and some other cpu requests a grace 
> >     period in the next few cpu cycles, there is chance a timer interrupt 
> >     can trigger rcu_core() before the next guest_entry, which would 
> >     introduce unnecessary latency, and cause be the issue we are trying to 
> >     fix.
> > 
> > I mean, it makes the bug reproduce less, but do not fix it.
> 
> OK, then it sounds like something might be needed, but again, I must
> defer to you guys on the need.
> 
> If there is a need, what are your thoughts on the approach that Sean
> suggested?

Something just hit me, and maybe I need to propose something more generic.

But I need some help with a question first:
- Let's forget about kvm for a few seconds, and focus in host userspace:
  If we have a high priority (user) task running on nohz_full cpu, and it 
  gets interrupted (IRQ, let's say). Is it possible that the interrupting task 
  gets interrupted by the timer interrupt which will check for 
  rcu_pending(), and return true ? (1)
  (or is there any protection for that kind of scenario?) (2)

1)
If there is any possibility of this happening, maybe we could consider 
fixing it by adding some kind of generic timeout in RCU code, to be used 
in nohz_full, so that it keeps track of the last time an quiescent state 
ran in this_cpu, and returns false on rcu_pending() if one happened in the 
last N jiffies.

In this case, we could also report a quiescent state in guest_exit, and 
make use of above generic RCU timeout to avoid having any rcu_core() 
running in those switching moments.

2)
On the other hand, if there are mechanisms in place for avoiding such 
scenario, it could justify adding some similar mechanism to KVM guest_exit 
/ guest_entry. In case adding such mechanism is hard, or expensive, we 
could use the KVM-only timeout previously suggested to avoid what we are 
currently hitting.

Could we use both a timeout & context tracking in this scenario? yes
But why do that, if the timeout would work just as well?

If I missed something, please let me know. :)

Thanks!
Leo