Re: [PATCH] KVM: arm64: Correctly handle the mmio faulting

Santosh Shukla <sashukla@xxxxxxxxxx> · Mon, 26 Oct 2020 10:26:41 +0530

    Hi Marc,

Thanks for the review comment.

    On 10/23/2020 4:59 PM, Marc Zyngier
      wrote:

      Hi Santosh,

      Thanks for this.

      On 2020-10-21 17:16, Santosh Shukla wrote:

      The Commit:6d674e28 introduces a notion to
        detect and handle the

        device mapping. The commit checks for the VM_PFNMAP flag is set

        in vma->flags and if set then marks force_pte to true such
        that

        if force_pte is true then ignore the THP function check

        (/transparent_hugepage_adjust()).

        There could be an issue with the VM_PFNMAP flag setting and
        checking.

        For example consider a case where the mdev vendor driver
        register's

        the vma_fault handler named vma_mmio_fault(), which maps the

        host MMIO region in-turn calls remap_pfn_range() and maps

        the MMIO's vma space. Where, remap_pfn_range implicitly sets

        the VM_PFNMAP flag into vma->flags.

        Now lets assume a mmio fault handing flow where guest first
        access

        the MMIO region whose 2nd stage translation is not present.

        So that results to arm64-kvm hypervisor executing guest abort
        handler,

        like below:

        kvm_handle_guest_abort() -->

         user_mem_abort()--> {

            ...

            0. checks the vma->flags for the VM_PFNMAP.

            1. Since VM_PFNMAP flag is not yet set so force_pte _is_
        false;

            2. gfn_to_pfn_prot() -->

                __gfn_to_pfn_memslot() -->

                    fixup_user_fault() -->

                        handle_mm_fault()-->

                            __do_fault() -->

                               vma_mmio_fault() --> // vendor's mdev
        fault

        handler

                                remap_pfn_range()--> // Here sets the
        VM_PFNMAP

                                                      flag into
        vma->flags.

            3. Now that force_pte is set to false in step-2),

               will execute transparent_hugepage_adjust() func and

               that lead to Oops [4].

         }

      Hmmm. Nice. Any chance you could provide us with an actual
      reproducer?

    I tried to create the reproducer scenario with vfio-pci driver
      using

      nvidia GPU in PT mode, As because vfio-pci driver now supports

      vma faulting (/vfio_pci_mmap_fault) so could create a crude
      reproducer

      situation with that.

      To create the repro - I did an ugly hack into arm64/kvm/mmu.c.

      The hack is to make sure that stage2 mapping are not created

      at the time of vm_init by unsetting VM_PFNMAP flag. This
      `unsetting` flag

      needed because vfio-pci's mmap func(/vfio_pci_mmap) by-default

      sets the VM_PFNMAP flag for the MMIO region but I want

      the remap_pfn_range() func to set the _PFNMAP flag via vfio's
      fault

      handler func vfio_pci_mmap_fault().

      So with above, when guest access the MMIO region, this will

      trigger the mmio fault path at arm64-kvm hypervisor layer like
      below:

      user_mem_abort() {->...

          --> Check the VM_PFNMAP flag, since not set so marks
      force_pte=false

          ....

          __gfn_to_pfn_memslot()-->

          ...

          handle_mm_fault()-->

          do_fault()-->

          vfio_pci_mmio_fault()-->

          remap_pfn_range()--> Now will set the VM_PFNMAP flag.

      }

      I have also set the force_pte=true, just to avoid the THP Oops
      which was

      mentioned in the previous thread.

      hackish change to reproduce scenario:
    --->

      diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c

      index 19aacc7d64de..9ef70dc624cf 100644

      --- a/arch/arm64/kvm/mmu.c

      +++ b/arch/arm64/kvm/mmu.c

      @@ -836,9 +836,9 @@ static int user_mem_abort(struct kvm_vcpu
      *vcpu, phys_addr_t fault_ipa,

              }

              if (is_error_noslot_pfn(pfn))

                      return -EFAULT;

      -

              if (kvm_is_device_pfn(pfn)) {

                      device = true;

      +               force_pte = true;

              } else if (logging_active && !write_fault) {

                      /*

                       * Only actually map the page as writable if this
      was a write

      @@ -1317,6 +1317,11 @@ int kvm_arch_prepare_memory_region(struct
      kvm *kvm,

                      vm_start = max(hva, vma->vm_start);

                      vm_end = min(reg_end, vma->vm_end);

      +               /* Hack to make sure stage2 mapping not present,
      thus trigger

      +                * user_mem_abort for stage2 mapping */

      +               if (vma->vm_flags & VM_PFNMAP) {

      +                       vma->vm_flags = vma->vm_flags &
      (~VM_PFNMAP);

      +               }

                      if (vma->vm_flags & VM_PFNMAP) {

                              gpa_t gpa = mem->guest_phys_addr +

                                          (vm_start -
      mem->userspace_addr);

        The proposition is to check is_iomap flag before executing the
        THP

        function transparent_hugepage_adjust().

        [4] THP Oops:

        pc:
          kvm_is_transparent_hugepage+0x18/0xb0

          ...

          ...

          user_mem_abort+0x340/0x9b8

          kvm_handle_guest_abort+0x248/0x468

          handle_exit+0x150/0x1b0

          kvm_arch_vcpu_ioctl_run+0x4d4/0x778

          kvm_vcpu_ioctl+0x3c0/0x858

          ksys_ioctl+0x84/0xb8

          __arm64_sys_ioctl+0x28/0x38

        Tested on Huawei Kunpeng Taishan-200 arm64 server, Using
        VFIO-mdev

        device.

        Linux tip: 583090b1

        Fixes: 6d674e28 ("KVM: arm/arm64: Properly handle faulting of
        device

        mappings")

        Signed-off-by: Santosh Shukla <sashukla@xxxxxxxxxx>

        ---

         arch/arm64/kvm/mmu.c | 2 +-

         1 file changed, 1 insertion(+), 1 deletion(-)

        diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c

        index 3d26b47..ff15357 100644

        --- a/arch/arm64/kvm/mmu.c

        +++ b/arch/arm64/kvm/mmu.c

        @@ -1947,7 +1947,7 @@ static int user_mem_abort(struct kvm_vcpu
        *vcpu,

        phys_addr_t fault_ipa,

               * If we are not forced to use page mapping, check if we
        are

               * backed by a THP and thus use block mapping if possible.

               */

        -     if (vma_pagesize == PAGE_SIZE && !force_pte)

        +     if (vma_pagesize == PAGE_SIZE && !force_pte
        && !is_iomap(flags))

                      vma_pagesize =
        transparent_hugepage_adjust(memslot, hva,

        &pfn, &fault_ipa);

              if (writable)

      Why don't you directly set force_pte to true at the point where we

      update

      the flags? It certainly would be a bit more readable:

    Yes.

    diff --git
      a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c

      index 3d26b47a1343..7a4ad984d54e 100644

      --- a/arch/arm64/kvm/mmu.c

      +++ b/arch/arm64/kvm/mmu.c

      @@ -1920,6 +1920,7 @@ static int user_mem_abort(struct kvm_vcpu
      *vcpu,

      phys_addr_t fault_ipa,

             if (kvm_is_device_pfn(pfn)) {

                     mem_type = PAGE_S2_DEVICE;

                     flags |= KVM_S2PTE_FLAG_IS_IOMAP;

      +               force_pte = true;

             } else if (logging_active) {

                     /*

                      * Faults on pages in a memslot with logging
      enabled

      and almost directly applies to what we have queued for 5.10.

    Right. I believe - Above code is sightly changed at linux-next
      commit: 9695c4ff 

    Modified one looks like below:
    diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c

      index 19aacc7..d4cd253 100644

      --- a/arch/arm64/kvm/mmu.c

      +++ b/arch/arm64/kvm/mmu.c

      @@ -839,6 +839,7 @@ static int user_mem_abort(struct kvm_vcpu
      *vcpu, phys_addr_t fault_ipa,

              if (kvm_is_device_pfn(pfn)) {

                      device = true;

      +               force_pte = true;

              } else if (logging_active && !write_fault) {

                      /*

                       * Only actually map the page as writable if this
      was a write

    pl. let me know if above is okay and will send out v2.
    Thanks.
    Santosh

    Thanks,

              M.

      --

      Jazz is not dead. It just smells funny...

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm