Sean Christopherson <seanjc@xxxxxxxxxx> writes: > Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu > is read exactly once, and by ensuring the vCPU is booted from guest mode > if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in > kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if > kvm_request_needs_ipi() returns true. > > Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or > kvm_request_needs_ipi()) means the target vCPU could get migrated (change > vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and > reading vcpu->mode. If that happens, the kick/IPI will be sent to the > old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs. > > Although failing to kick the vCPU is not exactly ideal, practically > speaking it cannot cause a functional issue unless there is also a bug in > the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s > behavior. > > The purpose of sending an IPI is purely to get a vCPU into the host (or > out of reading SPTEs) so that the vCPU can recognize a change in state, > e.g. a KVM_REQ_* request. If vCPU's handling of the state change is > required for correctness, KVM must ensure either the vCPU sees the change > before entering the guest, or that the sender sees the vCPU as running in > guest mode. All architectures handle this by (a) sending the request > before calling kvm_vcpu_kick() and (b) checking for requests _after_ > setting vcpu->mode. > > x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to > ensure it kicks and waits for vCPUs that started reading SPTEs _before_ > MMU changes were finalized, but any vCPU that starts reading after MMU > changes were finalized will see the new state and can continue on > uninterrupted. > > For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g. > x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied > upon for functional correctness, e.g. in the dirty log case, userspace > cannot assume it has a 100% complete log if vCPUs are still running. > > All that said, eliminate the benign race since the cost of doing so is an > "extra" atomic cmpxchg() in the case where the target vCPU is loaded by > the current pCPU or is not loaded at all. I.e. the kick will be skipped > due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as > opposed to the kick being skipped because of the cpu checks. > > Keep the "cpu != me" checks even though they appear useless/impossible at > first glance. x86 processes guest IPI writes in a fast path that runs in > IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And > calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or > READING_SHADOW_PAGE_TABLES is perfectly reasonable. > > Note, a race with the cpu_online() check in kvm_vcpu_kick() likely > persists, e.g. the vCPU could exit guest mode and get offlined between > the cpu_online() check and the sending of smp_send_reschedule(). But, > the online check appears to exist only to avoid a WARN in x86's > native_smp_send_reschedule() that fires if the target CPU is not online. > The reschedule WARN exists because CPU offlining takes the CPU out of the > scheduling pool, i.e. the WARN is intended to detect the case where the > kernel attempts to schedule a task on an offline CPU. The actual sending > of the IPI is a non-issue as at worst it will simpy be dropped on the > floor. In other words, KVM's usurping of the reschedule IPI could > theoretically trigger a WARN if the stars align, but there will be no > loss of functionality. > > [*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a > > Cc: Venkatesh Srinivas <venkateshs@xxxxxxxxxx> > Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> > Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel") > Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx> > --- > virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++-------- > 1 file changed, 28 insertions(+), 8 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 3e67c93ca403..786b914db98f 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -273,14 +273,26 @@ bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req, > continue; > > kvm_make_request(req, vcpu); > - cpu = vcpu->cpu; > > if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu)) > continue; > > - if (tmp != NULL && cpu != -1 && cpu != me && > - kvm_request_needs_ipi(vcpu, req)) > - __cpumask_set_cpu(cpu, tmp); > + /* > + * Note, the vCPU could get migrated to a different pCPU at any > + * point after kvm_request_needs_ipi(), which could result in > + * sending an IPI to the previous pCPU. But, that's ok because "OK" (unless there's risk someone will think of Oklahoma and take it as an offense) :-) > + * the purpose of the IPI is to ensure the vCPU returns to > + * OUTSIDE_GUEST_MODE, which is satisfied if the vCPU migrates. > + * Entering READING_SHADOW_PAGE_TABLES after this point is also > + * ok, as the requirement is only that KVM wait for vCPUs that "OK" > + * were reading SPTEs _before_ any changes were finalized. See > + * kvm_vcpu_kick() for more details on handling requests. > + */ > + if (tmp != NULL && kvm_request_needs_ipi(vcpu, req)) { > + cpu = READ_ONCE(vcpu->cpu); > + if (cpu != -1 && cpu != me) > + __cpumask_set_cpu(cpu, tmp); > + } > } > > called = kvm_kick_many_cpus(tmp, !!(req & KVM_REQUEST_WAIT)); > @@ -3309,16 +3321,24 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_wake_up); > */ > void kvm_vcpu_kick(struct kvm_vcpu *vcpu) > { > - int me; > - int cpu = vcpu->cpu; > + int me, cpu; > > if (kvm_vcpu_wake_up(vcpu)) > return; > > + /* > + * Note, the vCPU could get migrated to a different pCPU at any point > + * after kvm_arch_vcpu_should_kick(), which could result in sending an > + * IPI to the previous pCPU. But, that's ok because the purpose of the "OK" > + * IPI is to force the vCPU to leave IN_GUEST_MODE, and migrating the > + * vCPU also requires it to leave IN_GUEST_MODE. > + */ > me = get_cpu(); > - if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) > - if (kvm_arch_vcpu_should_kick(vcpu)) > + if (kvm_arch_vcpu_should_kick(vcpu)) { > + cpu = READ_ONCE(vcpu->cpu); > + if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) It seems it was Marcelo who wrote "(unsigned)cpu < nr_cpu_ids" back in 2009 but isn't it the same as "cpu != -1" or are there any other possible negative values? I don't think vcpu->cpu can go above nr_cpu_ids somehow but maybe it can? > smp_send_reschedule(cpu); > + } > put_cpu(); > } > EXPORT_SYMBOL_GPL(kvm_vcpu_kick); Reviewed-by: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx> -- Vitaly