On Fri, May 10, 2024, Breno Leitao wrote: > > IMO, reworking it to be like this is more straightforward: > > > > int nr_vcpus, start, i, idx, yielded; > > struct kvm *kvm = me->kvm; > > struct kvm_vcpu *vcpu; > > int try = 3; > > > > nr_vcpus = atomic_read(&kvm->online_vcpus); > > if (nr_vcpus < 2) > > return; > > > > /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */ > > smp_rmb(); > > Why do you need this now? Isn't the RCU read lock in xa_load() enough? No. RCU read lock doesn't suffice, because on kernels without PREEMPT_COUNT rcu_read_lock() may be a literal nop. There may be a _compiler_ barrier, but smp_rmb() requires more than a compiler barrier on many architectures. And just as importantly, KVM shouldn't be relying on the inner details of other code without a hard guarantee of that behavior. E.g. KVM does rely on srcu_read_unlock() to provide a full memory barrier, but KVM does so through an "official" API, smp_mb__after_srcu_read_unlock(). > > kvm_vcpu_set_in_spin_loop(me, true); > > > > start = READ_ONCE(kvm->last_boosted_vcpu) + 1; > > for (i = 0; i < nr_vcpus; i++) { > > Why do you need to started at the last boosted vcpu? I.e, why not > starting at 0 and skipping me->vcpu_idx and kvm->last_boosted_vcpu? To provide round-robin style yielding in order to (hopefully) yield to the vCPU that is holding a spinlock (or some other asset that is causing a vCPU to spin in kernel mode). E.g. if there are 4 vCPUs all running on a single CPU, vCPU3 gets preempted while holding a spinlock, and all vCPUs are contented for said spinlock then starting at vCPU0 every time would result in vCPU1 yielding to vCPU0, and vCPU0 yielding back to vCPU1, indefinitely. Starting at the last boosted vCPU instead results in vCPU0 yielding to vCPU1, vCPU1 yielding to vCPU2, and vCPU2 yielding to vCPU3, thus getting back to the vCPU that holds the spinlock soon-ish. I'm sure more sophisticated/performant approaches are possible, but they would likely be more complex, require persistent state (a.k.a. memory usage), and/or need knowledge of the workload being run.