On Wed, May 15, 2024 at 7:08 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > > On 5/15/24 10:06, Liang Chen wrote: > > In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the > > L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while > > the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's > > registers, corrupting the L2 VM. To avoid this error, registers should be > > treated as read-only when the vCPU is actively running an L2 VM. > > No, this is intentional. The L0 hypervisor has full control on the CPU > registers, no matter if the VM is in guest mode or not. > I see. Thank you! The patch [1] will provide a convenient way for userspace to determine if the CPU is in guest mode, which should be sufficient for userspace to avoid mistakenly setting L2 registers. [1] https://lore.kernel.org/kvm/20240508132502.184428-1-julian.stecklina@xxxxxxxxxxxxxxxxxxxxx/T/#u > Looking at the syzkaller report, the first thing to do is to remove the > threading, because vcpu ioctls are anyway serialized by the vcpu mutex. > > Let's assume that the first 7 operations, i.e. all except KVM_RUN and > KVM_SET_REGS, execute in sequence. There is possible interleaving of > the two syz_kvm_setup_cpu$x86 calls, but because only the second sets > up nested virtualization, we can hope that we're lucky.[1] > > To do so, change the main to something like > > int main(int argc, char **argv) > { > int i; > syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul, > /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, > /*offset=*/0ul); > syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul, > /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul, > /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, > /*offset=*/0ul); > syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul, > /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1, > /*offset=*/0ul); > > // r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000440), 0x0, 0x0) > execute_call(0); > // r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) > execute_call(1); > // r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) > execute_call(2); > // mmap(&(0x7f0000000000/0x3000)=nil, 0x3000, 0x2, 0x13, r2, 0x0) > execute_call(3); > // syz_kvm_setup_cpu$x86(r1, 0xffffffffffffffff, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000200)=[@text16={0x10, 0x0}], 0x1, 0x0, 0x0, 0x0) > execute_call(4); > // ioctl$KVM_REGISTER_COALESCED_MMIO(0xffffffffffffffff, 0x4010ae67, &(0x7f0000000000)={0x2}) > execute_call(5); > // syz_kvm_setup_cpu$x86(0xffffffffffffffff, r2, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000340)=[@text64={0x40, 0x0}], 0x1, 0x50, 0x0, 0x0) > execute_call(6); > // ioctl$KVM_RUN(r2, 0xae80, 0x0) (rerun: 64) > for (i = 0; i < atoi(argv[1]); i++) > execute_call(7); > // ioctl$KVM_SET_REGS(r2, 0x4090ae82, &(0x7f00000000c0)={[0x7], 0x0, 0x60000}) > // interleaved with KVM_RUN > execute_call(8); > // one more KVM_RUN to show the bug > execute_call(7); > return 0; > } > > This reproduces the fact that KVM_SET_REGS can sneak in between any > two KVM_RUN. You'll see that changing the argument may cause an entry with > invalid guest state (argument is 0 or >= 3 - that's ok) or the WARN > (argument == 1). > > The attached cleaned up reproducer shows that the problem is simply that > EFLAGS.VM is set in 64-bit mode. To fix it, it should be enough to do > a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like > a few lines below. > Yes, that was the situation we were trying to deal with. However, I am not quite sure if I fully understand the suggestion, "To fix it, it should be enough to do a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below.". From what I see, "(vmx->nested.nested_run_pending, vcpu->kvm) == true" in __vmx_handle_exit can be a result of an invalid VMCS12 from L1 that somehow escaped checking when trapped into L0 in nested_vmx_run. It is not convenient to tell whether it was a result of userspace register_set ops, as we are discussing, or an invalid VMCS12 supplied by L1. Additionally, nested_vmx_vmexit warns when 'vmx->nested.nested_run_pending is true,' saying that "trying to cancel vmlaunch/vmresume is a bug". Thanks, Liang > Paolo > > [1] in fact the first syz_kvm_setup_cpu$v86 passes vcpufd==-1 and > the second passes vmfd==-1. Each of them does only half of the work. > The ioctl$KVM_REGISTER_COALESCED_MMIO is also mostly dummy because it > passes -1 as the file descriptor, but it happens to set > vcpu->run->request_interrupt_window! Which is important later. > > > > Signed-off-by: Liang Chen <liangchen.linux@xxxxxxxxx> > > Reported-by: syzbot+988d9efcdf137bc05f66@xxxxxxxxxxxxxxxxxxxxxxxxx > > Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@xxxxxxxxxx > > --- > > arch/x86/kvm/x86.c | 6 ++++++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 91478b769af0..30f63a7dd120 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -11527,6 +11527,12 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) > > > > int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) > > { > > + /* > > + * Prevent L0 VMM from inadvertently modifying L2 VM registers directly. > > + */ > > + if (is_guest_mode(vcpu)) > > + return -EACCES; > > + > > vcpu_load(vcpu); > > __set_regs(vcpu, regs); > > vcpu_put(vcpu);