On Thu, Jun 20, 2024 at 10:26:38AM -0700, Paul E. McKenney wrote: > On Thu, Jun 20, 2024 at 04:03:41AM -0300, Leonardo Bras wrote: > > On Wed, May 15, 2024 at 07:57:41AM -0700, Paul E. McKenney wrote: > > > On Wed, May 15, 2024 at 01:45:33AM -0300, Leonardo Bras wrote: > > > > On Tue, May 14, 2024 at 03:54:16PM -0700, Paul E. McKenney wrote: > > > > > On Mon, May 13, 2024 at 06:47:13PM -0300, Leonardo Bras Soares Passos wrote: > > > > > > On Mon, May 13, 2024 at 4:40 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > > > > > > > > > On Fri, May 10, 2024, Leonardo Bras wrote: > > > > > > > > As of today, KVM notes a quiescent state only in guest entry, which is good > > > > > > > > as it avoids the guest being interrupted for current RCU operations. > > > > > > > > > > > > > > > > While the guest vcpu runs, it can be interrupted by a timer IRQ that will > > > > > > > > check for any RCU operations waiting for this CPU. In case there are any of > > > > > > > > such, it invokes rcu_core() in order to sched-out the current thread and > > > > > > > > note a quiescent state. > > > > > > > > > > > > > > > > This occasional schedule work will introduce tens of microsseconds of > > > > > > > > latency, which is really bad for vcpus running latency-sensitive > > > > > > > > applications, such as real-time workloads. > > > > > > > > > > > > > > > > So, note a quiescent state in guest exit, so the interrupted guests is able > > > > > > > > to deal with any pending RCU operations before being required to invoke > > > > > > > > rcu_core(), and thus avoid the overhead of related scheduler work. > > > > > > > > > > > > > > Are there any downsides to this? E.g. extra latency or anything? KVM will note > > > > > > > a context switch on the next VM-Enter, so even if there is extra latency or > > > > > > > something, KVM will eventually take the hit in the common case no matter what. > > > > > > > But I know some setups are sensitive to handling select VM-Exits as soon as possible. > > > > > > > > > > > > > > I ask mainly because it seems like a no brainer to me to have both VM-Entry and > > > > > > > VM-Exit note the context switch, which begs the question of why KVM isn't already > > > > > > > doing that. I assume it was just oversight when commit 126a6a542446 ("kvm,rcu,nohz: > > > > > > > use RCU extended quiescent state when running KVM guest") handled the VM-Entry > > > > > > > case? > > > > > > > > > > > > I don't know, by the lore I see it happening in guest entry since the > > > > > > first time it was introduced at > > > > > > https://lore.kernel.org/all/1423167832-17609-5-git-send-email-riel@xxxxxxxxxx/ > > > > > > > > > > > > Noting a quiescent state is cheap, but it may cost a few accesses to > > > > > > possibly non-local cachelines. (Not an expert in this, Paul please let > > > > > > me know if I got it wrong). > > > > > > > > > > Yes, it is cheap, especially if interrupts are already disabled. > > > > > (As in the scheduler asks RCU to do the same amount of work on its > > > > > context-switch fastpath.) > > > > > > > > Thanks! > > > > > > > > > > > > > > > I don't have a historic context on why it was just implemented on > > > > > > guest_entry, but it would make sense when we don't worry about latency > > > > > > to take the entry-only approach: > > > > > > - It saves the overhead of calling rcu_virt_note_context_switch() > > > > > > twice per guest entry in the loop > > > > > > - KVM will probably run guest entry soon after guest exit (in loop), > > > > > > so there is no need to run it twice > > > > > > - Eventually running rcu_core() may be cheaper than noting quiescent > > > > > > state every guest entry/exit cycle > > > > > > > > > > > > Upsides of the new strategy: > > > > > > - Noting a quiescent state in guest exit avoids calling rcu_core() if > > > > > > there was a grace period request while guest was running, and timer > > > > > > interrupt hits the cpu. > > > > > > - If the loop re-enter quickly there is a high chance that guest > > > > > > entry's rcu_virt_note_context_switch() will be fast (local cacheline) > > > > > > as there is low probability of a grace period request happening > > > > > > between exit & re-entry. > > > > > > - It allows us to use the rcu patience strategy to avoid rcu_core() > > > > > > running if any grace period request happens between guest exit and > > > > > > guest re-entry, which is very important for low latency workloads > > > > > > running on guests as it reduces maximum latency in long runs. > > > > > > > > > > > > What do you think? > > > > > > > > > > Try both on the workload of interest with appropriate tracing and > > > > > see what happens? The hardware's opinion overrides mine. ;-) > > > > > > > > That's a great approach! > > > > > > > > But in this case I think noting a quiescent state in guest exit is > > > > necessary to avoid a scenario in which a VM takes longer than RCU > > > > patience, and it ends up running rcuc in a nohz_full cpu, even if guest > > > > exit was quite brief. > > > > > > > > IIUC Sean's question is more on the tone of "Why KVM does not note a > > > > quiescent state in guest exit already, if it does in guest entry", and I > > > > just came with a few arguments to try finding a possible rationale, since > > > > I could find no discussion on that topic in the lore for the original > > > > commit. > > > > > > Understood, and maybe trying it would answer that question quickly. > > > Don't get me wrong, just because it appears to work in a few tests doesn't > > > mean that it really works, but if it visibly blows up, that answers the > > > question quite quickly and easily. ;-) > > > > > > But yes, if it appears to work, there must be a full investigation into > > > whether or not the change really is safe. > > > > > > Thanx, Paul > > > > Hello Paul, Sean, sorry for the delay on this. > > > > I tested x86 by counting cycles (using rdtsc_ordered()). > > > > Cycles were counted upon function entry/exit on > > {svm,vmx}_vcpu_enter_exit(), and right before / after > > __{svm,vmx}_vcpu_run() in the same function. > > > > The main idea was to get cycles spend in the procedures before entering > > guest (such as reporting RCU quiescent state in entry / exit) and the > > cycles actually used by the VM. > > > > Those cycles were summed-up and stored in per-cpu structures, with a > > counter to get the average value. I then created a debug file to read the > > results and reset the counters. > > > > As for the VM, it got 20 vcpus, 8GB memory, and was booted with idle=poll. > > > > The workload inside the VM consisted in cyclictest in 16 vcpus > > (SCHED_FIFO,p95), while maintaining it's main routine in 4 other cpus > > (SCHED_OTHER). This was made to somehow simulate busy and idle-er cpus. > > > > $cyclictest -m -q -p95 --policy=fifo -D 1h -h60 -t 16 -a 4-19 -i 200 > > --mainaffinity 0-3 > > > > All tests were run for exaclty 1 hour, and the clock counter was reset at > > the same moment cyclictest stared. After that VM was poweroff from guest. > > Results show the average for all CPUs in the same category, in cycles. > > > > With above setup, I tested 2 use cases: > > 1 - Non-RT host, no CPU Isolation, no RCU patience (regular use-case) > > 2 - PREEMPT_RT host, with CPU Isolation for all vcpus (pinned), and > > RCU patience = 1000ms (best case for RT) > > > > Results are: > > # Test case 1: > > Vanilla: (average on all vcpus) > > VM Cycles / RT vcpu: 123287.75 > > VM Cycles / non-RT vcpu: 709847.25 > > Setup Cycles: 186.00 > > VM entries / RT vcpu: 58737094.81 > > VM entries / non-RT vcpu: 10527869.25 > > Total cycles in RT VM: 7241564260969.80 > > Total cycles in non-RT VM: 7473179035472.06 > > > > Patched: (average on all vcpus) > > VM Cycles / RT vcpu: 124695.31 (+ 1.14%) > > VM Cycles / non-RT vcpu: 710479.00 (+ 0.09%) > > Setup Cycles: 218.65 (+17.55%) > > VM entries / RT vcpu: 60654285.44 (+ 3.26%) > > VM entries / non-RT vcpu: 11003516.75 (+ 4.52%) > > Total cycles in RT VM: 7563305077093.26 (+ 4.44%) > > Total cycles in non-RT VM: 7817767577023.25 (+ 4.61%) > > > > Discussion: > > Setup cycles raised in ~33 cycles, increasing overhead. > > It proves that noting a quiescent state in guest entry introduces setup > > routine costs, which is expected. > > > > On the other hand, both the average time spend inside the VM and the number > > of VM entries raised, causing the VM to have ~4.5% more cpu cycles > > available to run, which is positive. Extra cycles probably came from not > > having invoke_rcu_core() getting ran after VM exit. > > > > > > # Test case 2: > > Vanilla: (average on all vcpus) > > VM Cycles / RT vcpu: 123785.63 > > VM Cycles / non-RT vcpu: 698758.25 > > Setup Cycles: 187.20 > > VM entries / RT vcpu: 61096820.75 > > VM entries / non-RT vcpu: 11191873.00 > > Total cycles in RT VM: 7562908142051.72 > > Total cycles in non-RT VM: 7820413591702.25 > > > > Patched: (average on all vcpus) > > VM Cycles / RT vcpu: 123137.13 (- 0.52%) > > VM Cycles / non-RT vcpu: 696824.25 (- 0.28%) > > Setup Cycles: 229.35 (+22.52%) > > VM entries / RT vcpu: 61424897.13 (+ 0.54%) > > VM entries / non-RT vcpu: 11237660.50 (+ 0.41%) > > Total cycles in RT VM: 7563685235393.27 (+ 0.01%) > > Total cycles in non-RT VM: 7830674349667.13 (+ 0.13%) > > > > Discussion: > > Setup cycles raised in ~42 cycles, increasing overhead. > > It proves that noting a quiescent state in guest entry introduces setup > > routine costs, which is expected. > > > > The average time spend inside the VM was reduced, but the number of VM > > entries raised, causing the VM to have around the same number of cpu cycles > > available to run, meaning that the overhead caused by reporting RCU > > quiescent state in VM exit got absorbed, and it may have to do with those > > rare invoke_rcu_core()s that were bothering latency. > > > > The difference is much smaller compared to case 1, and this is probably > > because there is a clause in rcu_pending() for isolated (nohz_full) cpus > > which may be already inhibiting a lot of invoke_rcu_core()s. > > > > Sean, Paul, what do you think? > > First, thank you for doing this work! Thanks you! > > I must defer to Sean on how to handle this tradeoff. My kneejerk reaction > would be to add yet another knob, but knobs are not free. Sure, let's wait for Sean's feedback. But I am very positive with the results, as we have lantecy improvements for RT, and IIUC performance improvements for non-RT. While we wait, I ran stress-ng in a VM with non-RT kernel (rcu.patience=0), and could see instruction count raising in around 1.5% after applying the patch. Maybe I am missing something, but I thought we could get the ~4.5% improvement as seen above. Thanks! Leo > > Thanx, Paul >