Re: [PATCH] KVM: x86: update %rip after emulating IO

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Fri, 15 Mar 2019 19:05:09 +0100

On 13/03/19 17:50, Jim Mattson wrote:
> On Mon, Mar 11, 2019 at 8:01 PM Sean Christopherson
> <sean.j.christopherson@xxxxxxxxx> wrote:
>>
>> Most (all?) x86 platforms provide a port IO based reset mechanism, e.g.
>> OUT 92h or CF9h.  Userspace may emulate said mechanism, i.e. reset a
>> vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM
>> that it is doing a reset, e.g. Qemu jams vCPU state and resumes running.
>>
>> To avoid corruping %rip after such a reset, commit 0967b7bf1c22 ("KVM:
>> Skip pio instruction when it is emulated, not executed") changed the
>> behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the
>> instruction prior to exiting to userspace.  Full emulation doesn't need
>> such tricks becase re-emulating the instruction will naturally handle
>> %rip being changed to point at the reset vector.
>>
>> Updating %rip prior to executing to userspace has several drawbacks:
>>
>>   - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation
>>     fails it will likely yell about the wrong address.
>>   - Single step exits to userspace for are effectively dropped as
>>     KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO.
>>   - Behavior of PIO emulation is different depending on whether it
>>     goes down the fast path or the slow path.
>>
>> Rather than skip the PIO instruction before exiting to userspace,
>> snapshot the linear %rip and cancel PIO completion if the current
>> value does not match the snapshot.  For a 64-bit vCPU, i.e. the most
>> common scenario, the snapshot and comparison has negligible overhead
>> as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra
>> VMREAD in this case.
>>
>> All other alternatives to snapshotting the linear %rip that don't
>> rely on an explicit reset announcenment suffer from one corner case
>> or another.  For example, canceling PIO completion on any write to
>> %rip fails if userspace does a save/restore of %rip, and attempting to
>> avoid that issue by canceling PIO only if %rip changed then fails if PIO
>> collides with the reset %rip.  Attempting to zero in on the exact reset
>> vector won't work for APs, which means adding more hooks such as the
>> vCPU's MP_STATE, and so on and so forth.
>>
>> Checking for a linear %rip match technically suffers from corner cases,
>> e.g. userspace could theoretically rewrite the underlying code page and
>> expect a different instruction to execute, or the guest hardcodes a PIO
>> reset at 0xfffffff0, but those are far, far outside of what can be
>> considered normal operation.
>>
>> Fixes: 432baf60eee3 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O")
>> Cc: <stable@xxxxxxxxxxxxxxx>
>> Reported-by: Jim Mattson <jmattson@xxxxxxxxxx>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
>> ---
>>
>> Although technically the "buggy" behavior goes back 10+ years, I used
>> the recent VMX change for Fixes since that was the commit that actually
>> led to a complaint.  Arguably the commit that re-introduced fast IN for
>> SVM (8370c3d08bd9 "kvm: svm: Add kvm_fast_pio_in support") should be
>> blamed, but given that his is more along the lines of "that's weird" as
>> opposed to "the world is burning", err on the side of cuation.
>>
>> That being said, odds are good that userspace won't even exercise the
>> rip checks.  Qemu has intentionally re-entered KVM to complete I/O since
>> commit 9ccfac9ea4 ("kvm: Unconditionally reenter kernel after IO exits")
>> in early 2011, i.e. testing this required modifying Qemu to not re-enter
>> the kernel.  And AFIAK no other userspace emulates port-based resets.
> 
> Does this break tools/testing/selftests/kvm/x86_64/state_test.c? IIRC,
> that test assumes that the %rip is advanced early, since it performs a
> migration with PIO completion pending. (Note that the documentation
> seems to imply that this migration may be ill-advised, since the guest
> state isn't consistent at this point.)

Yes, the test is broken.  It should do an entry with immediate exit = 1
before attempting migration.

Paolo