Re: [PATCH] KVM: Addressing a possible race in kvm_vcpu_on_spin:

Sean Christopherson <seanjc@xxxxxxxxxx> · Fri, 10 May 2024 07:39:14 -0700

On Fri, May 10, 2024, Breno Leitao wrote:
> > IMO, reworking it to be like this is more straightforward:
> > 
> > 	int nr_vcpus, start, i, idx, yielded;
> > 	struct kvm *kvm = me->kvm;
> > 	struct kvm_vcpu *vcpu;
> > 	int try = 3;
> > 
> > 	nr_vcpus = atomic_read(&kvm->online_vcpus);
> > 	if (nr_vcpus < 2)
> > 		return;
> > 
> > 	/* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */
> > 	smp_rmb();
> 
> Why do you need this now? Isn't the RCU read lock in xa_load() enough?

No.  RCU read lock doesn't suffice, because on kernels without PREEMPT_COUNT
rcu_read_lock() may be a literal nop.  There may be a _compiler_ barrier, but
smp_rmb() requires more than a compiler barrier on many architectures.

And just as importantly, KVM shouldn't be relying on the inner details of other
code without a hard guarantee of that behavior.  E.g. KVM does rely on
srcu_read_unlock() to provide a full memory barrier, but KVM does so through an
"official" API, smp_mb__after_srcu_read_unlock().

> > 	kvm_vcpu_set_in_spin_loop(me, true);
> > 
> > 	start = READ_ONCE(kvm->last_boosted_vcpu) + 1;
> > 	for (i = 0; i < nr_vcpus; i++) {
> 
> Why do you need to started at the last boosted vcpu? I.e, why not
> starting at 0 and skipping me->vcpu_idx and kvm->last_boosted_vcpu?

To provide round-robin style yielding in order to (hopefully) yield to the vCPU
that is holding a spinlock (or some other asset that is causing a vCPU to spin
in kernel mode).

E.g. if there are 4 vCPUs all running on a single CPU, vCPU3 gets preempted while
holding a spinlock, and all vCPUs are contented for said spinlock then starting
at vCPU0 every time would result in vCPU1 yielding to vCPU0, and vCPU0 yielding
back to vCPU1, indefinitely.

Starting at the last boosted vCPU instead results in vCPU0 yielding to vCPU1,
vCPU1 yielding to vCPU2, and vCPU2 yielding to vCPU3, thus getting back to the
vCPU that holds the spinlock soon-ish.

I'm sure more sophisticated/performant approaches are possible, but they would
likely be more complex, require persistent state (a.k.a. memory usage), and/or
need knowledge of the workload being run.