On Thu, Sep 09, 2021, Hou Wenlong wrote: > It is guest's resposibility to use right instruction for hypercall, > hypervisor could emulate wrong instruction instead of modifying > guest's instruction. > > Signed-off-by: Hou Wenlong <houwenlong93@xxxxxxxxxxxxxxxxx> > --- > @@ -8747,16 +8747,15 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) > } > EXPORT_SYMBOL_GPL(kvm_emulate_hypercall); > > -static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt) > +static int emulator_hypercall(struct x86_emulate_ctxt *ctxt) > { > + int ret; > struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); > - char instruction[3]; > - unsigned long rip = kvm_rip_read(vcpu); > - > - static_call(kvm_x86_patch_hypercall)(vcpu, instruction); > > - return emulator_write_emulated(ctxt, rip, instruction, 3, > - &ctxt->exception); > + ret = kvm_emulate_hypercall_noskip(vcpu); I have mixed feelings on calling out of the emulator to do all this work. One on hand, I think it's a somewhat silly, arbitrary boundary. On the other hand, reading and writing GPRs directly means the emulation context holds stale data, especially in the case where the hypercall triggers an exit to userspace. > + if (ret) > + return X86EMUL_CONTINUE; > + return X86EMUL_IO_NEEDED; Unfortunately, simply returning X86EMUL_IO_NEEDED is not sufficient to handle the KVM_HC_MAP_GPA_RANGE case. x86_emulate_instruction() doesn't directly act on X86EMUL_IO_NEEDED, because x86_emulate_insn() doesn't return X86EMUL_*, it returns EMULATION_FAILED, EMULATION_OK, etc... x86_emulate_instruction() instead looks for other signals to differentiate between exception injection, PIO, MMIO, etc... The IO_NEEDED path would also need to provide an alternative complete_userspace_io callback, otherwise complete_hypercall_exit() will attempt to skip the instruction using e.g. vmcs.VM_EXIT_INSTRUCTION_LEN and likely send the guest into the weeds. In the prior patch, having kvm_emulate_hypercall_noskip() skip the CPL check is definitely wrong, and skipping Xen/Hyper-V hypercall support is odd, though arguably correct since Xen/Hyper-V hypercalls should never hit this path. All of the above are solvable problems, but there is a non-trivial cost in doing so, especially looking ahead to TDX support, which also needs/wants to split kvm_emulate_hypercall() but in slightly different ways[*]. I 100% agree that patching the hypercall instruction is awful. There are myriad fatal issues with the current approach: 1. Patches using an emulated guest write, which will fail if RIP is not mapped writable, and even injects a #PF into the guest on failure. 2. Doesn't ensure the write is "atomic", e.g. a hypercall that splits a page boundary will be handled as two separate writes, which means that a partial, corrupted instruction can be observed by a separate vCPU. 3. Doesn't serialize other CPU cores after updating the code stream. 4. Completely fails to account for the case where KVM is emulating due to invalid guest state with unrestricted_guest=0. Patching and retrying the instruction will result in vCPU getting stuck in an infinite loop. But, the "support" _so_ awful, especially #1, that there's practically zero chance that a modern guest kernel can rely on KVM to patch the guest. E.g. patching fails on modern Linux due to kernel code being mapped NX (barring admin override). This was addressed in the Linux guest back in 2014 by commit c1118b3602c2 ("x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only"). In other words, odds are very good that only old Linux guest kernels rely on KVM's patching. Because of that, my preference is to keep the patching, do our best to make it suck a little less, and aim to completely remove the patching entirely at some point in the future. For #1, inject a #UD instead of #PF if the write fails. The guest will still likely die, but the failure signature is much friendlier to debuggers. For #2 and #3, do nothing as fixing those is non-trivial. For #4, inject a #UD if the hypercall instruction is already the "right" instruction. I.e. retroactively define KVM's ABI to be that KVM hypercalls have undefined behavior in Big Real Mode and other modes that trigger invalid guest state (since the hypercalls will still work if unrestricted_guest=1). This can't be an ABI breakage since it does not work today and can't possibly have ever worked in the past. I have mostly-complete to do the above (I ran afoul of the NX thing), I'll hopefully get them out next week. [*] https://lkml.kernel.org/r/9e1e66787c50232391e20cb4b3d1c5b249e3f910.1625186503.git.isaku.yamahata@xxxxxxxxx