On Thu, 2024-07-25 at 08:59 -0400, Maxim Levitsky wrote: > On Tue, 2024-07-16 at 15:36 -0700, Sean Christopherson wrote: > > On Mon, Jul 15, 2024, Maxim Levitsky wrote: > > > VMX code uses segment cache to avoid reading guest segment fields. > > > > > > The cache is reset each time a segment's field (e.g base/access rights/etc) > > > is written, and then a new value of this field is written. > > > > > > However if the vCPU is preempted between these two events, and this > > > segment field is read (e.g kvm reads SS's access rights to check > > > if the vCPU is in kernel mode), then old field value will get > > > cached and never updated. > > > > It'be super helpful to include the gory details about how kvm_arch_vcpu_put() > > reads stale data. Without that information, it's very hard to figure out how > > getting preempted is problematic. > > I will do this in next version of this patch. > > > vmx_vcpu_reset resets the segment cache bitmask and then initializes > > the segments in the vmcs, however if the vcpus is preempted in the > > middle of this code, the kvm_arch_vcpu_put is called which > > reads SS's AR bytes to determine if the vCPU is in the kernel mode, > > which caches the old value. > > > > > Usually a lock is required to avoid such race but since vCPU segments > > > are only accessed by its vCPU thread, we can avoid a lock and > > > only disable preemption, in places where the segment cache > > > is invalidated and segment fields are updated. > > > > This doesn't fully fix the problem. It's not just kvm_sched_out() => kvm_arch_vcpu_put() > > that's problematic, it's any path that executes KVM code in interrupt context. > > And it's not just limited to segment registers, any register that is conditionally > > cached via arch.regs_avail is susceptible to races. > > > > Specifically, kvm_guest_state() and kvm_guest_get_ip() will read SS.AR_bytes and > > RIP in NMI and/or IRQ context when handling a PMI. > > A few possible ideas. > > > > 1. Force reads from IRQ/NMI context to skip the cache and go to the VMCS. > > This IMHO is the best solution. For segment cache its easy to do, the code > will be contained in vmx_read_guest_seg_* functions. > > For other VMX registers, this can be lot of work due to the way the code is scattered > around. Still probably double. > > > > 2. Same thing as #1, but focus it specifically on kvm_arch_vcpu_in_kernel() > > and kvm_arch_vcpu_get_ip(), and WARN if kvm_register_is_available() or > > vmx_segment_cache_test_set() is invoked from IRQ or NMI context. > > I agree on this, this is actually one of the suggestions I had originally. > ( I didn't notice the kvm_arch_vcpu_get_ip though ) > > I think I will implement this suggestion. > > > 3. Force caching of SS.AR_bytes, CS.AR_bytes, and RIP prior to kvm_after_interrupt(), > > rename preempted_in_kernel to something like "exited_in_kernel" and snapshot > > it before kvm_after_interrupt(), and add the same hardening as #2. > > > > This is doable because kvm_guest_state() should never read guest state for > > PMIs that occur between VM-Exit and kvm_after_interrupt(), nor should KVM > > write guest state in that window. And the intent of the "preempted in kernel" > > check is to query vCPU state at the time of exit. > > > > 5. Do a combination of #3 and patch 02 (#3 fixes PMIs, patch 02 fixes preemption). > > My vote is probably for #2 or #4. > #4 causes a NULL pointer deference here :) > > > I definitely think we need WARNs in the caching > > code, and in general kvm_arch_vcpu_put() shouldn't be reading cacheable state, i.e. > > I am fairly confident we can restrict it to checking CPL. > > > > I don't hate this patch by any means, but I don't love disabling preemption in a > > bunch of flows just so that the preempted_in_kernel logic works. > > > > Thanks for the suggestions! Hi, I decided to keep it simple. I'll send a patch which moves call to the vmx_segment_cache_clear to be after we done with the segment initialization in vmx_vcpu_reset, and later I write a refactoring/hardening to make sure that we don't read the cache from the interrupt context. Best regards, Maxim Levitsky > > Best regards, > Maxim Levitsky > > >