On Fri, May 03, 2024 at 02:29:57PM -0700, Sean Christopherson wrote: > On Fri, May 03, 2024, Leonardo Bras wrote: > > > KVM can provide that information with much better precision, e.g. KVM > > > knows when when it's in the core vCPU run loop. > > > > That would not be enough. > > I need to present the application/problem to make a point: > > > > - There is multiple isolated physical CPU (nohz_full) on which we want to > > run KVM_RT vcpus, which will be running a real-time (low latency) task. > > - This task should not miss deadlines (RT), so we test the VM to make sure > > the maximum latency on a long run does not exceed the latency requirement > > - This vcpu will run on SCHED_FIFO, but has to run on lower priority than > > rcuc, so we can avoid stalling other cpus. > > - There may be some scenarios where the vcpu will go back to userspace > > (from KVM_RUN ioctl), and that does not mean it's good to interrupt the > > this to run other stuff (like rcuc). > > > > Now, I understand it will cover most of our issues if we have a context > > tracking around the vcpu_run loop, since we can use that to decide not to > > run rcuc on the cpu if the interruption hapenned inside the loop. > > > > But IIUC we can have a thread that "just got out of the loop" getting > > interrupted by the timer, and asked to run rcu_core which will be bad for > > latency. > > > > I understand that the chance may be statistically low, but happening once > > may be enough to crush the latency numbers. > > > > Now, I can't think on a place to put this context trackers in kvm code that > > would avoid the chance of rcuc running improperly, that's why the suggested > > timeout, even though its ugly. > > > > About the false-positive, IIUC we could reduce it if we reset the per-cpu > > last_guest_exit on kvm_put. > > Which then opens up the window that you're trying to avoid (IRQ arriving just > after the vCPU is put, before the CPU exits to userspace). > > If you want the "entry to guest is imminent" status to be preserved across an exit > to userspace, then it seems liek the flag really should be a property of the task, > not a property of the physical CPU. Similar to how rcu_is_cpu_rrupt_from_idle() > detects that an idle task was interrupted, that goal is to detect if a vCPU task > was interrupted. > > PF_VCPU is already "taken" for similar tracking, but if we want to track "this > task will soon enter an extended quiescent state", I don't see any reason to make > it specific to vCPU tasks. Unless the kernel/KVM dynamically manages the flag, > which as above will create windows for false negatives, the kernel needs to > trust userspace to a certaine extent no matter what. E.g. even if KVM sets a > PF_xxx flag on the first KVM_RUN, nothing would prevent userspace from calling > into KVM to get KVM to set the flag, and then doing something else entirely with > the task. > > So if we're comfortable relying on the 1 second timeout to guard against a > misbehaving userspace, IMO we might as well fully rely on that guardrail. I.e. > add a generic PF_xxx flag (or whatever flag location is most appropriate) to let > userspace communicate to the kernel that it's a real-time task that spends the > overwhelming majority of its time in userspace or guest context, i.e. should be > given extra leniency with respect to rcuc if the task happens to be interrupted > while it's in kernel context. > I think I understand what you propose here. But I am not sure what would happen in this case: - RT guest task calls short HLT - Host schedule another kernel thread (other task) - Timer interruption, rcu_pending will() check the task which is not set with above flag. - rcuc runs, introducing latency - Goes back to previous kernel thread, finishes running with rcuc latency - Goes back to vcpu thread Isn't there any chance that, on an short guest HLT, the latency previously introduced by rcuc preempting another kernel thread gets to introduce a latency to the RT task running in the vcpu? Thanks! Leo -