On 12/10/21 23:29, Sean Christopherson wrote:
Inject a #UD if patching in the correct hypercall fails, e.g. due to emulator_write_emulated() failing because RIP is mapped not-writable by the guest. The guest is likely doomed in any case, but observing a #UD in the guest is far friendlier to debug/triage than a !WRITABLE #PF with CR2 pointing at the RIP of the faulting instruction. Ideally, KVM wouldn't patch at all; it's the guest's responsibility to identify and use the correct hypercall instruction (VMCALL vs. VMMCALL). Sadly, older Linux kernels prior to commit c1118b3602c2 ("x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only") do the wrong thing and blindly use VMCALL, i.e. removing the patching would break running VMs with older kernels. One could argue that KVM should be "fixed" to ignore guest paging protections instead of injecting #UD, but patching in the first place was a mistake as it was a hack-a-fix for a guest bug.
Sort of. I agree that patching is awful, but I'm not sure about injecting #UD vs. just doing the hypercall; the original reason for the patching was to allow Intel<->AMD cross-vendor migration to work somewhat.
That in turn promoted Linux's ill-conceived sloppiness of just using vmcall, which lasted until commit c1118b3602c2.
There are myriad fatal issues with KVM's patching: 1. Patches using an emulated guest write, which will fail if RIP is not mapped writable. This is the issue being mitigated. 2. Doesn't ensure the write is "atomic", e.g. a hypercall that splits a page boundary will be handled as two separate writes, which means that a partial, corrupted instruction can be observed by a vCPU.
Only the third bytes differs between VMCALL and VMMCALL so that's not really a problem. (Apparently what happened is that Microsoft asked Intel to use 0xc1 like AMD, and VMware asked AMD to use 0xd9 like Intel, or something like that; and they ended up swapping opcodes. But this may be an urban legend, no matter how plausible).
The big ones are 1 and 4. Thanks, Paolo
3. Doesn't serialize other CPU cores after updating the code stream. 4. Completely fails to account for the case where KVM is emulating due to invalid guest state with unrestricted_guest=0. Patching and retrying the instruction will result in vCPU getting stuck in an infinite loop. But, the "support" _so_ awful, especially #1, that there's practically zero chance that a modern guest kernel can rely on KVM to patch the guest. So, rather than proliferate KVM's bad behavior any further than the absolute minimum needed for backwards compatibility, just try to make it suck a little less.