Re: [PATCH 33/89] KVM: arm64: Handle guest stage-2 page-tables entirely at EL2

"Huang, Shaoqin" <shaoqin.huang@xxxxxxxxx> · Thu, 28 Jul 2022 14:50:12 +0800

On 7/27/2022 5:59 PM, Alexandru Elisei wrote:
Hi,

On Wed, Jun 08, 2022 at 09:16:56AM +0800, Huang, Shaoqin wrote:

On 6/1/2022 12:45 AM, Will Deacon wrote:
On Fri, May 20, 2022 at 05:03:29PM +0100, Alexandru Elisei wrote:
On Thu, May 19, 2022 at 02:41:08PM +0100, Will Deacon wrote:
Now that EL2 is able to manage guest stage-2 page-tables, avoid
allocating a separate MMU structure in the host and instead introduce a
new fault handler which responds to guest stage-2 faults by sharing
GUP-pinned pages with the guest via a hypercall. These pages are
recovered (and unpinned) on guest teardown via the page reclaim
hypercall.

Signed-off-by: Will Deacon <will@xxxxxxxxxx>
---
[..]
+static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+			  unsigned long hva)
+{
+	struct kvm_hyp_memcache *hyp_memcache = &vcpu->arch.pkvm_memcache;
+	struct mm_struct *mm = current->mm;
+	unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
+	struct kvm_pinned_page *ppage;
+	struct kvm *kvm = vcpu->kvm;
+	struct page *page;
+	u64 pfn;
+	int ret;
+
+	ret = topup_hyp_memcache(hyp_memcache, kvm_mmu_cache_min_pages(kvm));
+	if (ret)
+		return -ENOMEM;
+
+	ppage = kmalloc(sizeof(*ppage), GFP_KERNEL_ACCOUNT);
+	if (!ppage)
+		return -ENOMEM;
+
+	ret = account_locked_vm(mm, 1, true);
+	if (ret)
+		goto free_ppage;
+
+	mmap_read_lock(mm);
+	ret = pin_user_pages(hva, 1, flags, &page, NULL);

When I implemented memory pinning via GUP for the KVM SPE series, I
discovered that the pages were regularly unmapped at stage 2 because of
automatic numa balancing, as change_prot_numa() ends up calling
mmu_notifier_invalidate_range_start().

I was curious how you managed to avoid that, I don't know my way around
pKVM and can't seem to find where that's implemented.

With this series, we don't take any notice of the MMU notifiers at EL2
so the stage-2 remains intact. The GUP pin will prevent the page from
being migrated as the rmap walker won't be able to drop the mapcount.

It's functional, but we'd definitely like to do better in the long term.
The fd-based approach that I mentioned in the cover letter gets us some of
the way there for protected guests ("private memory"), but non-protected
guests running under pKVM are proving to be pretty challenging (we need to
deal with things like sharing the zero page...).

Will

My understanding is that with the pin_user_pages, the page that used by
guests (both protected and non-protected) will stay for a long time, and the
page will not be swapped or migrated. So no need to care about the MMU
notifiers. Is it right?

There are two things here.

First, pinning a page means making the data persistent in memory. From
Documentation/core-api/pin_user_pages.rst:

"FOLL_PIN is a *replacement* for FOLL_GET, and is for short term pins on
pages whose data *will* get accessed. As such, FOLL_PIN is a "more severe"
form of pinning. And finally, FOLL_LONGTERM is an even more restrictive
case that has FOLL_PIN as a prerequisite: this is for pages that will be
pinned longterm, and whose data will be accessed."

It does not mean that the translation table entry for the page is not
modified for as long as the pin exists. In the example I gave, automatic
NUMA balancing changes the protection of translation table entries to
PAGE_NONE, which will invoke the MMU notifers to unmap the corresponding
stage 2 entries, regardless of the fact that the pinned pages will not get
migrated the next time they are accessed.

There are other mechanisms in the kernel that do that, for example
split_huge_pmd(), which must always succeed, even if the THP is pinned (it
transfers the refcounts among the pages): "Note that split_huge_pmd()
doesn't have any limitations on refcounting: pmd can be split at any point
and never fails" (Documentation/vm/transhuge.rst, also see
__split_huge_pmd() from mm/huge_memory.c).

KSM also does that: it invokes the invalidate_range_start MMU notifier
before backing out of the merge because of the refcount (see mm/ksm.c::
try_to_merge_one_page -> write_protect_page).

This brings me to my second point: one might rightfully ask themselves (I
did!), why not invoke the MMU notifiers *after* checking that the page is
not pinned? It turns out that that is not reliable, because the refcount is
increased by GUP with the page lock held (which is a spinlock), but by
their design the invalidate_range_start MMU notifiers must be called from
interruptible + preemptible context. The only way to avoid races would be
to call the MMU notifier while holding the page table lock, which is
impossible.

Hope my explanation has been adequate.

Thanks,
Alex

Thanks for your clear explanation.