Re: [PATCH RFC v8 48/56] KVM: SVM: Add SNP-specific handling for memory attribute updates

Michael Roth <michael.roth@xxxxxxx> · Wed, 5 Apr 2023 18:48:26 -0500

On Wed, Mar 01, 2023 at 03:37:00PM -0800, Dave Hansen wrote:
> On 2/20/23 10:38, Michael Roth wrote:
> > +		/*
> > +		 * TODO: The RMP entry's hugepage bit is ignored for
> > +		 * shared/unassigned pages. Either handle looping through each
> > +		 * sub-page as part of snp_make_page_shared(), or remove the
> > +		 * level argument.
> > +		 */
> > +		if (op == SNP_PAGE_STATE_PRIVATE && order &&
> > +		    IS_ALIGNED(gfn, 1 << order) && (gfn + (1 << order)) <= end) {
> > +			level = order_to_level(order);
> > +			npages = 1 << order;
> > +		}
> 
> That's a wee bit obtuse.
> 
> First of all, I assume that the 'RFC' is because of these TODOs and they
> won't survive to the point when you ask for this to be merged.

Yes, will make sure to have all the TODOs in the tree addressed before
dropping the RFC tag.

> 
> BTW, what keeps the restrictedmem_get_page() offset and the gfn aligned?

I don't think anything enforces that currently, but there is a TODO in the
UPM v10 patchset to enforce that:

  https://github.com/AMDESE/linux/commit/5c86db7f98701f614c48946b733f2542c962f139#diff-e7514a224c92c2e47224f99919405a37ee7edc4612953135229cfb6e07a680d8R2131

So currently this patch relies on the following:

 - the fact that the memslot alignment/sizes for a standard x86 guest's
   associated memory regions are 2M-aligned, so when they are bound to a
   restrictedmem FD they are naturally packed in at restrictedmem offsets
   that are also 2M-aligned. But of course we can't assume userspace will
   live up to this assumption and need the above TODO in KVM to enforce
   this when registering new memslots.

 - that restrictedmem/shmem will ensure that THPs are only allocated for
   restrictedmem offsets that are 2M-aligned. I think this enforcement
   happens in shmem_alloc_hugefolio().

which both seem to hold in testing. But it's probably a good idea to add
an explicit check for this, at least until KVM implements something to
enforce this earlier in guest life-cycle.

> 
> Let's start with this:
> 
> > +static inline u8 order_to_level(int order)
> > +{
> > +	BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G);
> > +
> > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G))
> > +		return PG_LEVEL_1G;
> > +
> > +	if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M))
> > +		return PG_LEVEL_2M;
> > +
> > +	return PG_LEVEL_4K;
> > +}
> 
> Right now, 'order' comes only from restrictedmem_get_page(), which I dug
> out of:
> 
> > https://github.com/mdroth/linux/commit/c6792672cd11737fd255dff10b2d5b6bccc626a8
> 
> That order is *only* filled in by THPs.  That makes the PG_LEVEL_1G
> stuff here kinda silly.  I guess it might be seen as thorough, but it's
> dead code.  I'd probably just make this work on order==9 || order==0 and
> warn on anything else.

Ok, makes sense.

> 
> I'd also highly recommend some comments about how racy this all is.  I
> guess it probably works, but it would be good to add some comments about
> page splits and collapsing.

Collapsing while in this code path should be ok since the 4K sub-pages will
just end up getting mapped as 4K in RMP table. KVM MMU will then map
them into nested page table as 4K as well and we'll get non-optimal
performance, but things should otherwise work.

Splitting is a similar story: if we map as 2M in RMP table, and then
afterward the page gets split, then KVM MMU during fault time would map
the pages in the NPT as 4K, and when the guest attempts to access
private pages of this sort they'll generate a nested page fault with
PFERR_GUEST_RMP_BIT and PFERR_GUEST_SIZEM_BIT set, and the code in
handle_rmp_page_fault() will issue a PSMASH instruction to split the
2M RMP entry into 512 4K RMP entries.

Will add some comments around this.

> 
> It's also not obvious why this only cares about private pages.

Mainly because the shared memory that actually get mapped into the guest
is always shared in the RMP table. It is normal VMA memory that is not
allocated by UPM/restrictedmem. We will never attempt to make them
private, so there is never a need to bother with switching them back to
shared.

So we only need to handle RMP updates for the UPM/restrictedmem PFNs.
Obviously for shared->private conversion before mapping it into the
guest, but also for private->shared conversion since we will still
get RMP check failures if we try to leave the PFNs as private in the
RMP table and map the above-mentioned VMA memory into the guest
instead.

Will add some more comments around this.

> 
> Anyway, this is the exact kind of thing where I really like a
> well-commented helper:
> 
> bool can_install_large_rmp_entry(gfn, order)
> {
> 	// small pages, blah blah
> 	if (!order)
> 		return false;
> 
> 	// The region being updated must be aligned
> 	if (!IS_ALIGNED(gfn, 1 << order))
> 		return false;
> 	// ... and fit
> 	if (gfn + (1 << order)) > end)
> 		return false;
> 
> 	return true;	
> }
> 
> Which gets used like this:
> 
> 	if (op == SNP_PAGE_STATE_PRIVATE &&
> 	    can_install_large_rmp_entry(gfn, order)) {
> 		level = ...
> 	}

Makes sense, will implement something along these lines.

Thanks!

-Mike