Re: [RFC PATCH v1 0/2] Avoid rcu_core() if CPU just left guest vcpu

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 3 May 2024 14:29:57 -0700

On Fri, May 03, 2024, Leonardo Bras wrote:
> > KVM can provide that information with much better precision, e.g. KVM
> > knows when when it's in the core vCPU run loop.
> 
> That would not be enough.
> I need to present the application/problem to make a point:
> 
> - There is multiple  isolated physical CPU (nohz_full) on which we want to 
>   run KVM_RT vcpus, which will be running a real-time (low latency) task.
> - This task should not miss deadlines (RT), so we test the VM to make sure 
>   the maximum latency on a long run does not exceed the latency requirement
> - This vcpu will run on SCHED_FIFO, but has to run on lower priority than
>   rcuc, so we can avoid stalling other cpus.
> - There may be some scenarios where the vcpu will go back to userspace
>   (from KVM_RUN ioctl), and that does not mean it's good to interrupt the 
>   this to run other stuff (like rcuc).
>
> Now, I understand it will cover most of our issues if we have a context 
> tracking around the vcpu_run loop, since we can use that to decide not to 
> run rcuc on the cpu if the interruption hapenned inside the loop.
> 
> But IIUC we can have a thread that "just got out of the loop" getting 
> interrupted by the timer, and asked to run rcu_core which will be bad for 
> latency.
> 
> I understand that the chance may be statistically low, but happening once 
> may be enough to crush the latency numbers.
> 
> Now, I can't think on a place to put this context trackers in kvm code that 
> would avoid the chance of rcuc running improperly, that's why the suggested 
> timeout, even though its ugly.
> 
> About the false-positive, IIUC we could reduce it if we reset the per-cpu 
> last_guest_exit on kvm_put.

Which then opens up the window that you're trying to avoid (IRQ arriving just
after the vCPU is put, before the CPU exits to userspace).

If you want the "entry to guest is imminent" status to be preserved across an exit
to userspace, then it seems liek the flag really should be a property of the task,
not a property of the physical CPU.  Similar to how rcu_is_cpu_rrupt_from_idle()
detects that an idle task was interrupted, that goal is to detect if a vCPU task
was interrupted.

PF_VCPU is already "taken" for similar tracking, but if we want to track "this
task will soon enter an extended quiescent state", I don't see any reason to make
it specific to vCPU tasks.  Unless the kernel/KVM dynamically manages the flag,
which as above will create windows for false negatives, the kernel needs to
trust userspace to a certaine extent no matter what.  E.g. even if KVM sets a
PF_xxx flag on the first KVM_RUN, nothing would prevent userspace from calling
into KVM to get KVM to set the flag, and then doing something else entirely with
the task.

So if we're comfortable relying on the 1 second timeout to guard against a
misbehaving userspace, IMO we might as well fully rely on that guardrail.  I.e.
add a generic PF_xxx flag (or whatever flag location is most appropriate) to let
userspace communicate to the kernel that it's a real-time task that spends the
overwhelming majority of its time in userspace or guest context, i.e. should be
given extra leniency with respect to rcuc if the task happens to be interrupted
while it's in kernel context.