Re: [PATCH v6 0/6] x86/xen: Add in-kernel Xen event channel delivery

Wanpeng Li <kernellwp@xxxxxxxxx> · Tue, 4 Jan 2022 15:48:08 +0800

On Sat, 25 Dec 2021 at 09:19, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Thu, Dec 23, 2021, Paolo Bonzini wrote:
> > On 12/22/21 16:18, David Woodhouse wrote:
> > > > > n a mapped
> > > > > shared info page.
> > > > >
> > > > > This can be used for routing MSI of passthrough devices to PIRQ event
> > > > > channels in a Xen guest, and we can build on it for delivering IPIs and
> > > > > timers directly from the kernel too.
> > > > Queued patches 1-5, thanks.
> > > >
> > > > Paolo
> > > Any word on when these will make it out to kvm.git? Did you find
> > > something else I need to fix?
> >
> > I got a WARN when testing the queue branch, now I'm bisecting.
>
> Was it perhaps this WARN?
>
>   WARNING: CPU: 5 PID: 2180 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3163 mark_page_dirty_in_slot+0x6b/0x80 [kvm]
>   Modules linked in: kvm_intel kvm irqbypass
>   CPU: 5 PID: 2180 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #81
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:mark_page_dirty_in_slot+0x6b/0x80 [kvm]
>    kvm_write_guest+0x117/0x120 [kvm]
>    kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm]
>    kvm_arch_vm_ioctl+0x20f/0xb60 [kvm]
>    kvm_vm_ioctl+0x711/0xd50 [kvm]
>    __x64_sys_ioctl+0x7f/0xb0
>    do_syscall_64+0x38/0xc0
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> If so, it was clearly introduced by commit 03c0304a86bc ("KVM: Warn if mark_page_dirty()
> is called without an active vCPU").  But that change is 100% correct as it's trivial
> to crash KVM without its protection.  E.g. running hyper_clock with these mods
>
> diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
> index 2dd78c1db4b6..ae0d0490580a 100644
> --- a/tools/testing/selftests/kvm/lib/kvm_util.c
> +++ b/tools/testing/selftests/kvm/lib/kvm_util.c
> @@ -371,6 +371,8 @@ struct kvm_vm *vm_create_with_vcpus(enum vm_guest_mode mode, uint32_t nr_vcpus,
>         vm_create_irqchip(vm);
>  #endif
>
> +       vm_enable_dirty_ring(vm, 0x10000);
> +
>         for (i = 0; i < nr_vcpus; ++i) {
>                 uint32_t vcpuid = vcpuids ? vcpuids[i] : i;
>
> diff --git a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
> index e0b2bb1339b1..1032695e3901 100644
> --- a/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
> +++ b/tools/testing/selftests/kvm/x86_64/hyperv_clock.c
> @@ -212,6 +212,8 @@ int main(void)
>         vm = vm_create_default(VCPU_ID, 0, guest_main);
>         run = vcpu_state(vm, VCPU_ID);
>
> +       vm_mem_region_set_flags(vm, 0, KVM_MEM_LOG_DIRTY_PAGES);
> +
>         vcpu_set_hv_cpuid(vm, VCPU_ID);
>
>         tsc_page_gva = vm_vaddr_alloc_page(vm);
>
> Triggers a NULL pointer deref.
>
>   BUG: kernel NULL pointer dereference, address: 0000000000000000
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x0000) - not-present page
>   PGD 12959a067 P4D 12959a067 PUD 129594067 PMD 0
>   Oops: 0000 [#1] SMP
>   CPU: 5 PID: 1784 Comm: hyperv_clock Not tainted 5.16.0-rc4+ #82
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:kvm_dirty_ring_get+0xe/0x30 [kvm]
>    mark_page_dirty_in_slot.part.0+0x30/0x60 [kvm]
>    kvm_write_guest+0x117/0x130 [kvm]
>    kvm_hv_invalidate_tsc_page+0x99/0xf0 [kvm]
>    kvm_arch_vm_ioctl+0x20f/0xb60 [kvm]
>    kvm_vm_ioctl+0x711/0xd50 [kvm]
>    __x64_sys_ioctl+0x7f/0xb0
>    do_syscall_64+0x38/0xc0
>    entry_SYSCALL_64_after_hwframe+0x44/0xae

I saw this warning by running vanilla hyperv_clock w/o modifying
against kvm/queue.

>
> Commit e880c6ea55b9 ("KVM: x86: hyper-v: Prevent using not-yet-updated TSC page
> by secondary CPUs") is squarely to blame as it was added after dirty ring, though
> in Vitaly's defense, David put it best: "That's a fairly awful bear trap".
>
> Assuming there's nothing special about vcpu0's clock, I belive the below fixes
> all issues.  If not, then the correct fix is to have vcpu0 block all other vCPUs
> until the update completes.

Could we invalidate hv->tsc_ref.tsc_sequence directly w/o marking page
dirty since after migration it is stale until the first successful
kvm_hv_setup_tsc_page() call?

    Wanpeng