Re: [PATCH] arm64/kvm: Fix zapping stage2 page table wrongly

Gavin Shan <gshan@xxxxxxxxxx> · Sun, 23 Aug 2020 09:59:26 +1000

Hi Marc,

On 8/22/20 8:01 PM, Marc Zyngier wrote:
On Sat, 22 Aug 2020 03:44:44 +0100,
Gavin Shan <gshan@xxxxxxxxxx> wrote:

Depending on the kernel configuration, PUD_SIZE could be equal to
PMD_SIZE. For example, both of them are 512MB with the following
kernel configuration. In this case, both PUD and PMD are folded
to PGD.

    CONFIG_ARM64_64K_PAGES   y
    CONFIG_ARM64_VA_BITS     42
    CONFIG_PGTABLE_LEVELS    2

With the above configuration, the stage2 PUD is used to backup the
512MB huge page when the stage2 mapping is built. During the mapping,
the PUD and its subordinate levels of page table entries are unmapped
if the PUD is present and not huge page sensitive in stage2_set_pud_huge().
Unfornately, the @addr isn't aligned to S2_PUD_SIZE and wrong page table
entries are zapped. It eventually leads to PUD's present bit can't be
cleared successfully and infinite loop in stage2_set_pud_huge().

This fixes the issue by checking with S2_{PUD, PMD}_SIZE instead of
{PUD, PMD}_SIZE to determine if stage2 PUD or PMD is used to back the
huge page. For this particular case, the stage2 PMD entry should be
used to backup the 512MB huge page with stage2_set_pmd_huge().

It isn't obvious to me how S2_PMD_SIZE can be different from PMD_SIZE,
and the current code certainly expects both to be equal (just look at
how often S2_*_SIZE is used in the current code to convince yourself).

My guess is that some lesser tested configurations (such as 64k pages)
break that assumption, and result in the wrong thing happening. Could
you please shed some light on it?


With the following kernel configuration, PUD_SIZE and PMD_SIZE are equal
and both of them are 512MB because P4D/PUD/PMD are folded into PGD.

   CONFIG_ARM64_64K_PAGES   y
   CONFIG_ARM64_VA_BITS     42
   CONFIG_PGTABLE_LEVELS    2
   PMD_SIZE                 512MB               (include/asm-generic/pgtable-no-pud.h)
   PUD_SIZE                 512MB               (include/asm-generic/pgtable-no-pmd.h)
   P4D_SIZE                 512MB               (include/asm-generic/pgtable-nop4d.h)
   S2_PMD_SIZE              512MB               (stage2_pgtable.h)
   S2_PUD_SIZE                4TB               (stage2_pgtable.h)

For this particular case, S2_PMD_SIZE and PMD_SIZE are equal and both
of them are 512MB. However, the issue is wrong size (PMD_SIZE/PUD_SIZE)
is checked to call stage2_set_pud_huge() or stage2_set_pmd_huge() in
user_mem_abort(). It causes stage2_set_pud_huge() is called to map the
512MB huge page, meaning stage2 PUD is used to back the 512MB huge page.

In stage2_set_pud_huge(), the S2 page table entries are zapped for the
range ((addr & S2_PUD_MASK), S2_PUD_SIZE), whose size is 4TB. However,
we're mapping 512MB huge page (block). It means unrelated page table
entries are cleared.

static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
                          struct kvm_memory_slot *memslot, unsigned long hva,
                          unsigned long fault_status)
{
    :
    if (vma_pagesize == PUD_SIZE) {           /* PUD_SIZE == 512MB */
        ret = stage2_set_pud_huge(mmu, memcache, fault_ipa, &new_pud);
    } else if (vma_pagesize == PMD_SIZE) {    /* PMD_SIZE == 512MB */
        ret = stage2_set_pmd_huge(mmu, memcache, fault_ipa, &new_pmd);
    } else {
        ret = stage2_set_pte(mmu, memcache, fault_ipa, &new_pte, flags);
    }
    :
}

The issue was initially reported by Eric and it can be reproduced on
upstream kernel/qemu with the configurations to enable 64KB page size
and 42-bits VA bits (CONFIG_ARM64_VA_BITS). Here is the command I used
to reproduce the issue. Note that the IPA limit reported from the machine
where I reproduced the issue is 44-bits, but qemu just uses 40-bits. It
means the stage2 pagetable has 3 levels.

start_vm_aarch64_hugetlbfs() {
   echo 32 > /sys/kernel/mm/hugepages/hugepages-524288kB/nr_hugepages

   /home/gavin/sandbox/qemu.main/aarch64-softmmu/qemu-system-aarch64           \
   --enable-kvm -machine virt,gic-version=host                                 \
   -cpu host -smp 8,sockets=8,cores=1,threads=1                                \
   -m 8G -mem-prealloc -mem-path /dev/hugepages                                \
   -monitor none -serial mon:stdio -nographic -s                               \
   -bios /home/gavin/sandbox/qemu.main/pc-bios/edk2-aarch64-code.fd            \
   -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image               \
   -initrd /home/gavin/sandbox/images/rootfs.cpio.xz                           \
   -append "earlycon=pl011,mmio,0x9000000"                                     \
   -device virtio-net-pci,netdev=unet,mac=52:54:00:f1:26:a6                    \
   -netdev user,id=unet,hostfwd=tcp::50959-:22                                 \
   -drive file=/home/gavin/sandbox/images/vm.img,if=none,format=raw,id=nvme0   \
   -device nvme,drive=nvme0,serial=foo                                         \
   -drive file=/home/gavin/sandbox/images/vm1.img,if=none,format=raw,id=nvme1  \
   -device nvme,drive=nvme1,serial=foo1
}

Fixes: 3c3736cd32bf ("KVM: arm/arm64: Fix handling of stage2 huge mappings")

This commit doesn't seem to match the code your changing (it doesn't
even come near user_mem_abort()). Are there any intermediate commits
that would better explain the problem?


When stage2_set_pud_huge() is called to map 512MB huge page, we run into
infinite loop to retry unmapping the memory range (S2_PUD_SIZE). With this
reverted, we didn't reproduce the issue. The commit is identified by "git bisect".

Cc: stable@xxxxxxxxxxxxxxx # v5.1+
Reported-by: Eric Auger <eric.auger@xxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
Tested-by: Eric Auger <eric.auger@xxxxxxxxxx>
Reviewed-by: Eric Auger <eric.auger@xxxxxxxxxx>
---
  arch/arm64/kvm/mmu.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 0121ef2c7c8d..deb1819ba9e2 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1964,7 +1964,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  		(fault_status == FSC_PERM &&
  		 stage2_is_exec(mmu, fault_ipa, vma_pagesize));
  
-	if (vma_pagesize == PUD_SIZE) {
+	if (vma_pagesize == S2_PUD_SIZE) {
  		pud_t new_pud = kvm_pfn_pud(pfn, mem_type);
  
  		new_pud = kvm_pud_mkhuge(new_pud);
@@ -1975,7 +1975,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
  			new_pud = kvm_s2pud_mkexec(new_pud);
  
  		ret = stage2_set_pud_huge(mmu, memcache, fault_ipa, &new_pud);
-	} else if (vma_pagesize == PMD_SIZE) {
+	} else if (vma_pagesize == S2_PMD_SIZE) {
  		pmd_t new_pmd = kvm_pfn_pmd(pfn, mem_type);
  
  		new_pmd = kvm_pmd_mkhuge(new_pmd);
--
2.23.0



Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm