On Mon, Apr 29, 2024, David Matlack wrote: > On Fri, Apr 26, 2024 at 2:01 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, Mar 07, 2024, David Matlack wrote: > > > > > > - if (current->on_rq) { > > > + if (current->on_rq && vcpu->wants_to_run) { > > > WRITE_ONCE(vcpu->preempted, true); > > > WRITE_ONCE(vcpu->ready, true); > > > } > > > > Long story short, I was playing around with wants_to_run for a few hairbrained > > ideas, and realized that there's a TOCTOU bug here. Userspace can toggle > > run->immediate_exit at will, e.g. can clear it after the kernel loads it to > > compute vcpu->wants_to_run. > > > > That's not fatal for this use case, since userspace would only be shooting itself > > in the foot, but it leaves a very dangerous landmine, e.g. if something else in > > KVM keys off of vcpu->wants_to_run to detect that KVM is in its run loop, i.e. > > relies on wants_to_run being set if KVM is in its core run loop. > > > > To address that, I think we should have all architectures check wants_to_run, not > > immediate_exit. > > Rephrasing to make sure I understand you correctly: It's possible for > KVM to see conflicting values of wants_to_run and immediate_exit, > since userspace can change immediate_exit at any point. e.g. It's > possible for KVM to see wants_to_run=true and immediate_exit=true, > which wouldn't make sense. This wouldn't cause any bugs today, but > could result in buggy behavior down the road so we might as well clean > it up now. Yep. > > Hmm, and we should probably go a step further and actively prevent using > > immediate_exit from the kernel, e.g. rename it to something scary like: > > > > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > > index 2190adbe3002..9c5fe1dae744 100644 > > --- a/include/uapi/linux/kvm.h > > +++ b/include/uapi/linux/kvm.h > > @@ -196,7 +196,11 @@ struct kvm_xen_exit { > > struct kvm_run { > > /* in */ > > __u8 request_interrupt_window; > > +#ifndef __KERNEL__ > > __u8 immediate_exit; > > +#else > > + __u8 hidden_do_not_touch; > > +#endif > > This would result in: > > vcpu->wants_to_run = !READ_ONCE(vcpu->run->hidden_do_not_touch); > > :) > > Of course we could pick a better name... Heh, yeah, for demonstration purposes only. > but isn't every field in kvm_run open to TOCTOU issues? Yep, and we've had bugs, e.g. see commit 0d033770d43a ("KVM: x86: Fix KVM_CAP_SYNC_REGS's sync_regs() TOCTOU issues"). > (Is immediate_exit really special enough to need this protection?) I think so. It's not that immediate_exit is more prone to TOCTOU bugs than other fields in kvm_run (though I do think immediate_exit does have higher potential for future bugs), or even that the severity of bugs that could occur with immediate_exit is high (which I again think is the case), it's that it's actually feasible to effectively prevent TOCTOU bugs with minimal cost (including ongoing maintenance cost). At the cost of a small-ish one-time change, we can protect *all* KVM code against improer usage of immediate_exit. Doing the same for other kvm_run fields is less feasiable, as the relevant logic is much more architecture specific. E.g. x86 now does a full copy of sregs and events in kvm_sync_regs, but not regs because the input for regs is never checked. And blindly creating an in-kernel copy of all state would be extremely wasteful for s390, which IIUC uses kvm_run.s.regs as _the_ buffer for guest register state.