Re: [PATCH] KVM: x86: Prevent L0 VMM from modifying L2 VM registers via ioctl

Liang Chen <liangchen.linux@xxxxxxxxx> · Fri, 17 May 2024 19:37:32 +0800

On Wed, May 15, 2024 at 7:08 PM Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote:
>
> On 5/15/24 10:06, Liang Chen wrote:
> > In a nested VM environment, a vCPU can run either an L1 or L2 VM. If the
> > L0 VMM tries to configure L1 VM registers via the KVM_SET_REGS ioctl while
> > the vCPU is running an L2 VM, it may inadvertently modify the L2 VM's
> > registers, corrupting the L2 VM. To avoid this error, registers should be
> > treated as read-only when the vCPU is actively running an L2 VM.
>
> No, this is intentional.  The L0 hypervisor has full control on the CPU
> registers, no matter if the VM is in guest mode or not.
>

I see. Thank you! The patch [1] will provide a convenient way for
userspace to determine if the CPU is in guest mode, which should be
sufficient for userspace to avoid mistakenly setting L2 registers.

[1] https://lore.kernel.org/kvm/20240508132502.184428-1-julian.stecklina@xxxxxxxxxxxxxxxxxxxxx/T/#u

> Looking at the syzkaller report, the first thing to do is to remove the
> threading, because vcpu ioctls are anyway serialized by the vcpu mutex.
>
> Let's assume that the first 7 operations, i.e. all except KVM_RUN and
> KVM_SET_REGS, execute in sequence.  There is possible interleaving of
> the two syz_kvm_setup_cpu$x86 calls, but because only the second sets
> up nested virtualization, we can hope that we're lucky.[1]
>
> To do so, change the main to something like
>
> int main(int argc, char **argv)
> {
>    int i;
>    syscall(__NR_mmap, /*addr=*/0x1ffff000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>            /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>            /*offset=*/0ul);
>    syscall(__NR_mmap, /*addr=*/0x20000000ul, /*len=*/0x1000000ul,
>            /*prot=PROT_WRITE|PROT_READ|PROT_EXEC*/ 7ul,
>            /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>            /*offset=*/0ul);
>    syscall(__NR_mmap, /*addr=*/0x21000000ul, /*len=*/0x1000ul, /*prot=*/0ul,
>            /*flags=MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE*/ 0x32ul, /*fd=*/-1,
>            /*offset=*/0ul);
>
>    // r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000440), 0x0, 0x0)
>    execute_call(0);
>    // r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
>    execute_call(1);
>    // r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0)
>    execute_call(2);
>    // mmap(&(0x7f0000000000/0x3000)=nil, 0x3000, 0x2, 0x13, r2, 0x0)
>    execute_call(3);
>    // syz_kvm_setup_cpu$x86(r1, 0xffffffffffffffff, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000200)=[@text16={0x10, 0x0}], 0x1, 0x0, 0x0, 0x0)
>    execute_call(4);
>    // ioctl$KVM_REGISTER_COALESCED_MMIO(0xffffffffffffffff, 0x4010ae67, &(0x7f0000000000)={0x2})
>    execute_call(5);
>    // syz_kvm_setup_cpu$x86(0xffffffffffffffff, r2, &(0x7f0000fe7000/0x18000)=nil, &(0x7f0000000340)=[@text64={0x40, 0x0}], 0x1, 0x50, 0x0, 0x0)
>    execute_call(6);
>    // ioctl$KVM_RUN(r2, 0xae80, 0x0) (rerun: 64)
>    for (i = 0; i < atoi(argv[1]); i++)
>      execute_call(7);
>    // ioctl$KVM_SET_REGS(r2, 0x4090ae82, &(0x7f00000000c0)={[0x7], 0x0, 0x60000})
>    // interleaved with KVM_RUN
>    execute_call(8);
>    // one more KVM_RUN to show the bug
>    execute_call(7);
>    return 0;
> }
>
> This reproduces the fact that KVM_SET_REGS can sneak in between any
> two KVM_RUN.  You'll see that changing the argument may cause an entry with
> invalid guest state (argument is 0 or >= 3 - that's ok) or the WARN
> (argument == 1).
>
> The attached cleaned up reproducer shows that the problem is simply that
> EFLAGS.VM is set in 64-bit mode.  To fix it, it should be enough to do
> a nested_vmx_vmexit(vcpu, EXIT_REASON_TRIPLE_FAULT, 0, 0); just like
> a few lines below.
>

Yes, that was the situation we were trying to deal with. However, I am
not quite sure if I fully understand the suggestion, "To fix it, it
should be enough to do a nested_vmx_vmexit(vcpu,
EXIT_REASON_TRIPLE_FAULT, 0, 0); just like a few lines below.". From
what I see, "(vmx->nested.nested_run_pending, vcpu->kvm) == true" in
__vmx_handle_exit can be a result of an invalid VMCS12 from L1 that
somehow escaped checking when trapped into L0 in nested_vmx_run. It is
not convenient to tell whether it was a result of userspace
register_set ops, as we are discussing, or an invalid VMCS12 supplied
by L1. Additionally, nested_vmx_vmexit warns when
'vmx->nested.nested_run_pending is true,' saying that "trying to
cancel vmlaunch/vmresume is a bug".

Thanks,
Liang

> Paolo
>
> [1] in fact the first syz_kvm_setup_cpu$v86 passes vcpufd==-1 and
> the second passes vmfd==-1.  Each of them does only half of the work.
> The ioctl$KVM_REGISTER_COALESCED_MMIO is also mostly dummy because it
> passes -1 as the file descriptor, but it happens to set
> vcpu->run->request_interrupt_window!  Which is important later.
>
>
> > Signed-off-by: Liang Chen <liangchen.linux@xxxxxxxxx>
> > Reported-by: syzbot+988d9efcdf137bc05f66@xxxxxxxxxxxxxxxxxxxxxxxxx
> > Closes: https://lore.kernel.org/all/0000000000007a9acb06151e1670@xxxxxxxxxx
> > ---
> >   arch/x86/kvm/x86.c | 6 ++++++
> >   1 file changed, 6 insertions(+)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 91478b769af0..30f63a7dd120 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11527,6 +11527,12 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> >
> >   int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
> >   {
> > +     /*
> > +      * Prevent L0 VMM from inadvertently modifying L2 VM registers directly.
> > +      */
> > +     if (is_guest_mode(vcpu))
> > +             return -EACCES;
> > +
> >       vcpu_load(vcpu);
> >       __set_regs(vcpu, regs);
> >       vcpu_put(vcpu);