Re: [PATCH] arm64/kvm: Fix zapping stage2 page table wrongly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Gavin,

On 2020-08-23 00:59, Gavin Shan wrote:
Hi Marc,

On 8/22/20 8:01 PM, Marc Zyngier wrote:
On Sat, 22 Aug 2020 03:44:44 +0100,
Gavin Shan <gshan@xxxxxxxxxx> wrote:

Depending on the kernel configuration, PUD_SIZE could be equal to
PMD_SIZE. For example, both of them are 512MB with the following
kernel configuration. In this case, both PUD and PMD are folded
to PGD.

    CONFIG_ARM64_64K_PAGES   y
    CONFIG_ARM64_VA_BITS     42
    CONFIG_PGTABLE_LEVELS    2

With the above configuration, the stage2 PUD is used to backup the
512MB huge page when the stage2 mapping is built. During the mapping,
the PUD and its subordinate levels of page table entries are unmapped
if the PUD is present and not huge page sensitive in stage2_set_pud_huge(). Unfornately, the @addr isn't aligned to S2_PUD_SIZE and wrong page table
entries are zapped. It eventually leads to PUD's present bit can't be
cleared successfully and infinite loop in stage2_set_pud_huge().

This fixes the issue by checking with S2_{PUD, PMD}_SIZE instead of
{PUD, PMD}_SIZE to determine if stage2 PUD or PMD is used to back the
huge page. For this particular case, the stage2 PMD entry should be
used to backup the 512MB huge page with stage2_set_pmd_huge().

It isn't obvious to me how S2_PMD_SIZE can be different from PMD_SIZE,
and the current code certainly expects both to be equal (just look at
how often S2_*_SIZE is used in the current code to convince yourself).

My guess is that some lesser tested configurations (such as 64k pages)
break that assumption, and result in the wrong thing happening. Could
you please shed some light on it?


With the following kernel configuration, PUD_SIZE and PMD_SIZE are equal
and both of them are 512MB because P4D/PUD/PMD are folded into PGD.

   CONFIG_ARM64_64K_PAGES   y
   CONFIG_ARM64_VA_BITS     42
   CONFIG_PGTABLE_LEVELS    2
   PMD_SIZE                 512MB
(include/asm-generic/pgtable-no-pud.h)
   PUD_SIZE                 512MB
(include/asm-generic/pgtable-no-pmd.h)
   P4D_SIZE                 512MB
(include/asm-generic/pgtable-nop4d.h)
   S2_PMD_SIZE              512MB               (stage2_pgtable.h)
   S2_PUD_SIZE                4TB               (stage2_pgtable.h)

For a start, this last value makes no sense at all. If that we only
have 2 levels, S2_PUD_SIZE is wrong, and should either be ignored or
be the same as S2_PMD_SIZE (4TB represents the whole of the guest's
IPA space).

For this particular case, S2_PMD_SIZE and PMD_SIZE are equal and both
of them are 512MB. However, the issue is wrong size (PMD_SIZE/PUD_SIZE)
is checked to call stage2_set_pud_huge() or stage2_set_pmd_huge() in
user_mem_abort(). It causes stage2_set_pud_huge() is called to map the
512MB huge page, meaning stage2 PUD is used to back the 512MB huge page.

Which should be fine, because that's all we can map. 2 levels, remember?
The real bug is that we consider that mapping a PUD is valid, while there
is no real PUD in this context.


In stage2_set_pud_huge(), the S2 page table entries are zapped for the
range ((addr & S2_PUD_MASK), S2_PUD_SIZE), whose size is 4TB. However,
we're mapping 512MB huge page (block). It means unrelated page table
entries are cleared.

We can't map 4TB. That's not possible for such configuration (and I
doubt you have more than 4TB of contiguous memory on your system).


static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
struct kvm_memory_slot *memslot, unsigned long hva,
                          unsigned long fault_status)
{
    :
    if (vma_pagesize == PUD_SIZE) {           /* PUD_SIZE == 512MB */
        ret = stage2_set_pud_huge(mmu, memcache, fault_ipa, &new_pud);
    } else if (vma_pagesize == PMD_SIZE) {    /* PMD_SIZE == 512MB */
        ret = stage2_set_pmd_huge(mmu, memcache, fault_ipa, &new_pmd);
    } else {
ret = stage2_set_pte(mmu, memcache, fault_ipa, &new_pte, flags);
    }
    :
}

The issue was initially reported by Eric and it can be reproduced on
upstream kernel/qemu with the configurations to enable 64KB page size
and 42-bits VA bits (CONFIG_ARM64_VA_BITS). Here is the command I used
to reproduce the issue. Note that the IPA limit reported from the machine
where I reproduced the issue is 44-bits, but qemu just uses 40-bits. It
means the stage2 pagetable has 3 levels.

No. KVM, in its current form, cannot have more levels at stage-2 than it
has at stage-1 (that's one the things we are fixing with Will's patches).


start_vm_aarch64_hugetlbfs() {
   echo 32 > /sys/kernel/mm/hugepages/hugepages-524288kB/nr_hugepages

Ah, hugetlb. I wonder if we have something funky here. Alexandru (cc'd)
was looking at something in this area (see [1]).


/home/gavin/sandbox/qemu.main/aarch64-softmmu/qemu-system-aarch64 \ --enable-kvm -machine virt,gic-version=host \ -cpu host -smp 8,sockets=8,cores=1,threads=1 \ -m 8G -mem-prealloc -mem-path /dev/hugepages \ -monitor none -serial mon:stdio -nographic -s \ -bios /home/gavin/sandbox/qemu.main/pc-bios/edk2-aarch64-code.fd \ -kernel /home/gavin/sandbox/linux.guest/arch/arm64/boot/Image \ -initrd /home/gavin/sandbox/images/rootfs.cpio.xz \ -append "earlycon=pl011,mmio,0x9000000" \ -device virtio-net-pci,netdev=unet,mac=52:54:00:f1:26:a6 \ -netdev user,id=unet,hostfwd=tcp::50959-:22 \ -drive file=/home/gavin/sandbox/images/vm.img,if=none,format=raw,id=nvme0 \ -device nvme,drive=nvme0,serial=foo \ -drive file=/home/gavin/sandbox/images/vm1.img,if=none,format=raw,id=nvme1 \
   -device nvme,drive=nvme1,serial=foo1
}

I'll try to reproduce this.

        M.

[1] https://lore.kernel.org/r/20200901133357.52640-1-alexandru.elisei@xxxxxxx
--
Jazz is not dead. It just smells funny...
_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm



[Index of Archives]     [Linux KVM]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux