Re: [PATCH v3 02/21] KVM: arm64: Add stand-alone page-table walker infrastructure

Gavin Shan <gshan@xxxxxxxxxx> · Thu, 3 Sep 2020 11:11:37 +1000

Hi Will,

On 9/2/20 9:02 PM, Will Deacon wrote:
On Wed, Sep 02, 2020 at 04:31:32PM +1000, Gavin Shan wrote:
On 8/25/20 7:39 PM, Will Deacon wrote:
The KVM page-table code is intricately tied into the kernel page-table
code and re-uses the pte/pmd/pud/p4d/pgd macros directly in an attempt
to reduce code duplication. Unfortunately, the reality is that there is
an awful lot of code required to make this work, and at the end of the
day you're limited to creating page-tables with the same configuration
as the host kernel. Furthermore, lifting the page-table code to run
directly at EL2 on a non-VHE system (as we plan to to do in future
patches) is practically impossible due to the number of dependencies it
has on the core kernel.

Introduce a framework for walking Armv8 page-tables configured
independently from the host kernel.

Cc: Marc Zyngier <maz@xxxxxxxxxx>
Cc: Quentin Perret <qperret@xxxxxxxxxx>
Signed-off-by: Will Deacon <will@xxxxxxxxxx>
---
   arch/arm64/include/asm/kvm_pgtable.h | 101 ++++++++++
   arch/arm64/kvm/hyp/Makefile          |   2 +-
   arch/arm64/kvm/hyp/pgtable.c         | 290 +++++++++++++++++++++++++++
   3 files changed, 392 insertions(+), 1 deletion(-)
   create mode 100644 arch/arm64/include/asm/kvm_pgtable.h
   create mode 100644 arch/arm64/kvm/hyp/pgtable.c

[...]

+struct kvm_pgtable_walk_data {
+	struct kvm_pgtable		*pgt;
+	struct kvm_pgtable_walker	*walker;
+
+	u64				addr;
+	u64				end;
+};
+

Some of the following function might be worthy to be inlined, considering
their complexity :)

I'll leave that for the compiler to figure out :)

Ok :)

+static u32 kvm_pgd_pages(u32 ia_bits, u32 start_level)
+{
+	struct kvm_pgtable pgt = {
+		.ia_bits	= ia_bits,
+		.start_level	= start_level,
+	};
+
+	return __kvm_pgd_page_idx(&pgt, -1ULL) + 1;
+}
+

It seems @pgt.start_level is assigned with wrong value here.
For example, @start_level is 2 when @ia_bits and PAGE_SIZE
are 40 and 64KB separately. In this case, __kvm_pgd_page_idx()
always return zero. However, the extra page covers up the
issue. I think something like below might be needed:

	struct kvm_pgtable pgt = {
		.ia_bits	= ia_bits,
		.start_level	= KVM_PGTABLE_MAX_LEVELS - start_level + 1,
	};

Hmm, we're pulling the start_level right out of the vtcr, so I don't see
how it can be wrong. In your example, a start_level of 2 seems correct to
me, as we'll translate 13 bits there, then 13 bits at level 3 which covers
the 24 bits you need (with a 16-bit offset within the page).

Your suggestion would give us a start_level of 1, which has a redundant
level of translation. Maybe you're looking at the levels upside-down? The
top level is level 0 and each time you walk to a new level, that number
increases.

But perhaps I'm missing something. Please could you elaborate if you think
there's a problem here?

Thanks for the explanation. I think I was understanding the code in wrong
way. In this particular path, __kvm_pgd_page_idx() is used to calculate
how many subordinate pages needed to hold PGDs. If I'm correct, there are
16 pages for PGDs to the maximal degree. So current implementation looks
correct to me.

There is another question, which might not relevant. I added some logs
around and hopefully my calculation is making sense. I have following
configuration (values) in my experiment. I'm including the kernel log
to make information complete:

   [ 5089.107147] kvm_arch_init_vm: kvm@0xfffffe0028460000, type=0x0
   [ 5089.112973] kvm_arm_setup_stage2: kvm@0xfffffe0028460000, type=0x0
   [ 5089.119157]    kvm_ipa_limit=0x2c, phys_shift=0x28
   [ 5089.123936]    kvm->arch.vtcr=0x00000000802c7558
   [ 5089.128552] kvm_init_stage2_mmu: kvm@0xfffffe0028460000
   [ 5089.133765] kvm_pgtable_stage2_init: kvm@0xfffffe0028460000, ia_bits=0x28,start_level=0x2

   PAGE_SIZE:       64KB
   @kvm->arch.vtcr: 0x00000000_802c7558
   @ipa_bits:       40
   @start_level:    2

   #define KVM_PGTABLE_MAX_LEVELS            4U

   static u64 kvm_granule_shift(u32 level)
   {
        return (KVM_PGTABLE_MAX_LEVELS - level) * (PAGE_SHIFT - 3) + 3;
   }

   static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
   {
        u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
        u64 mask = BIT(pgt->ia_bits) - 1;

        return (addr & mask) >> shift;

        // shift = kvm_granule_shift(2 - 1) = ((3 * 13) + 3) = 42
        // mask  = ((1UL << 40) - 1)
        // return (0x000000ff_ffffffff >> 42) = 0
        //
        // QUESTION: Since we have 40-bits @ipa_bits, why we need shift 42-bits here.
   }

I was also thinking about the following case, which is making sense
to me. Note I didn't add logs to debug for this case.

   PAGE_SIZE:     4KB
   @ipa_bits:     40
   @start_level:  1

   static u32 __kvm_pgd_page_idx(struct kvm_pgtable *pgt, u64 addr)
   {
        u64 shift = kvm_granule_shift(pgt->start_level - 1); /* May underflow */
        u64 mask = BIT(pgt->ia_bits) - 1;

        return (addr & mask) >> shift;

        // shift = kvm_granule_shift(1 - 1) = ((4 * 9) + 3) = 39
        // mask  = ((1UL << 40) - 1)
        // return (0x000000ff_ffffffff >> 39) = 1
   }

+static int _kvm_pgtable_walk(struct kvm_pgtable_walk_data *data)
+{
+	u32 idx;
+	int ret = 0;
+	struct kvm_pgtable *pgt = data->pgt;
+	u64 limit = BIT(pgt->ia_bits);
+
+	if (data->addr > limit || data->end > limit)
+		return -ERANGE;
+
+	if (!pgt->pgd)
+		return -EINVAL;
+
+	for (idx = kvm_pgd_page_idx(data); data->addr < data->end; ++idx) {
+		kvm_pte_t *ptep = &pgt->pgd[idx * PTRS_PER_PTE];
+
+		ret = __kvm_pgtable_walk(data, ptep, pgt->start_level);
+		if (ret)
+			break;
+	}
+
+	return ret;
+}
+

I guess we need bail on the following condition:

         if (data->addr >= limit || data->end >= limit)
             return -ERANGE;

What's wrong with the existing check? In particular, I think we _want_
to support data->end == limit (it's exclusive). If data->addr == limit,
then we'll have a size of zero and the loop won't run.

I was thinking @limit is exclusive, so we need bail when hitting the
ceiling. The @limit was figured out from @ia_bits. For example, it's
0x00000100_00000000 when @ia_bits is 40-bits, and it's invalid adress
to the guest, but I'm still wrong in this case :)

Thanks,
Gavin

_______________________________________________
kvmarm mailing list
kvmarm@xxxxxxxxxxxxxxxxxxxxx
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm