On 6/1/2022 12:45 AM, Will Deacon wrote:
On Fri, May 20, 2022 at 05:03:29PM +0100, Alexandru Elisei wrote:
On Thu, May 19, 2022 at 02:41:08PM +0100, Will Deacon wrote:
Now that EL2 is able to manage guest stage-2 page-tables, avoid
allocating a separate MMU structure in the host and instead introduce a
new fault handler which responds to guest stage-2 faults by sharing
GUP-pinned pages with the guest via a hypercall. These pages are
recovered (and unpinned) on guest teardown via the page reclaim
hypercall.
Signed-off-by: Will Deacon <will@xxxxxxxxxx>
---
[..]
+static int pkvm_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
+ unsigned long hva)
+{
+ struct kvm_hyp_memcache *hyp_memcache = &vcpu->arch.pkvm_memcache;
+ struct mm_struct *mm = current->mm;
+ unsigned int flags = FOLL_HWPOISON | FOLL_LONGTERM | FOLL_WRITE;
+ struct kvm_pinned_page *ppage;
+ struct kvm *kvm = vcpu->kvm;
+ struct page *page;
+ u64 pfn;
+ int ret;
+
+ ret = topup_hyp_memcache(hyp_memcache, kvm_mmu_cache_min_pages(kvm));
+ if (ret)
+ return -ENOMEM;
+
+ ppage = kmalloc(sizeof(*ppage), GFP_KERNEL_ACCOUNT);
+ if (!ppage)
+ return -ENOMEM;
+
+ ret = account_locked_vm(mm, 1, true);
+ if (ret)
+ goto free_ppage;
+
+ mmap_read_lock(mm);
+ ret = pin_user_pages(hva, 1, flags, &page, NULL);
When I implemented memory pinning via GUP for the KVM SPE series, I
discovered that the pages were regularly unmapped at stage 2 because of
automatic numa balancing, as change_prot_numa() ends up calling
mmu_notifier_invalidate_range_start().
I was curious how you managed to avoid that, I don't know my way around
pKVM and can't seem to find where that's implemented.
With this series, we don't take any notice of the MMU notifiers at EL2
so the stage-2 remains intact. The GUP pin will prevent the page from
being migrated as the rmap walker won't be able to drop the mapcount.
It's functional, but we'd definitely like to do better in the long term.
The fd-based approach that I mentioned in the cover letter gets us some of
the way there for protected guests ("private memory"), but non-protected
guests running under pKVM are proving to be pretty challenging (we need to
deal with things like sharing the zero page...).
Will
My understanding is that with the pin_user_pages, the page that used by
guests (both protected and non-protected) will stay for a long time, and
the page will not be swapped or migrated. So no need to care about the
MMU notifiers. Is it right?