On Wed, Mar 01, 2023 at 03:37:00PM -0800, Dave Hansen wrote: > On 2/20/23 10:38, Michael Roth wrote: > > + /* > > + * TODO: The RMP entry's hugepage bit is ignored for > > + * shared/unassigned pages. Either handle looping through each > > + * sub-page as part of snp_make_page_shared(), or remove the > > + * level argument. > > + */ > > + if (op == SNP_PAGE_STATE_PRIVATE && order && > > + IS_ALIGNED(gfn, 1 << order) && (gfn + (1 << order)) <= end) { > > + level = order_to_level(order); > > + npages = 1 << order; > > + } > > That's a wee bit obtuse. > > First of all, I assume that the 'RFC' is because of these TODOs and they > won't survive to the point when you ask for this to be merged. Yes, will make sure to have all the TODOs in the tree addressed before dropping the RFC tag. > > BTW, what keeps the restrictedmem_get_page() offset and the gfn aligned? I don't think anything enforces that currently, but there is a TODO in the UPM v10 patchset to enforce that: https://github.com/AMDESE/linux/commit/5c86db7f98701f614c48946b733f2542c962f139#diff-e7514a224c92c2e47224f99919405a37ee7edc4612953135229cfb6e07a680d8R2131 So currently this patch relies on the following: - the fact that the memslot alignment/sizes for a standard x86 guest's associated memory regions are 2M-aligned, so when they are bound to a restrictedmem FD they are naturally packed in at restrictedmem offsets that are also 2M-aligned. But of course we can't assume userspace will live up to this assumption and need the above TODO in KVM to enforce this when registering new memslots. - that restrictedmem/shmem will ensure that THPs are only allocated for restrictedmem offsets that are 2M-aligned. I think this enforcement happens in shmem_alloc_hugefolio(). which both seem to hold in testing. But it's probably a good idea to add an explicit check for this, at least until KVM implements something to enforce this earlier in guest life-cycle. > > Let's start with this: > > > +static inline u8 order_to_level(int order) > > +{ > > + BUILD_BUG_ON(KVM_MAX_HUGEPAGE_LEVEL > PG_LEVEL_1G); > > + > > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_1G)) > > + return PG_LEVEL_1G; > > + > > + if (order >= KVM_HPAGE_GFN_SHIFT(PG_LEVEL_2M)) > > + return PG_LEVEL_2M; > > + > > + return PG_LEVEL_4K; > > +} > > Right now, 'order' comes only from restrictedmem_get_page(), which I dug > out of: > > > https://github.com/mdroth/linux/commit/c6792672cd11737fd255dff10b2d5b6bccc626a8 > > That order is *only* filled in by THPs. That makes the PG_LEVEL_1G > stuff here kinda silly. I guess it might be seen as thorough, but it's > dead code. I'd probably just make this work on order==9 || order==0 and > warn on anything else. Ok, makes sense. > > I'd also highly recommend some comments about how racy this all is. I > guess it probably works, but it would be good to add some comments about > page splits and collapsing. Collapsing while in this code path should be ok since the 4K sub-pages will just end up getting mapped as 4K in RMP table. KVM MMU will then map them into nested page table as 4K as well and we'll get non-optimal performance, but things should otherwise work. Splitting is a similar story: if we map as 2M in RMP table, and then afterward the page gets split, then KVM MMU during fault time would map the pages in the NPT as 4K, and when the guest attempts to access private pages of this sort they'll generate a nested page fault with PFERR_GUEST_RMP_BIT and PFERR_GUEST_SIZEM_BIT set, and the code in handle_rmp_page_fault() will issue a PSMASH instruction to split the 2M RMP entry into 512 4K RMP entries. Will add some comments around this. > > It's also not obvious why this only cares about private pages. Mainly because the shared memory that actually get mapped into the guest is always shared in the RMP table. It is normal VMA memory that is not allocated by UPM/restrictedmem. We will never attempt to make them private, so there is never a need to bother with switching them back to shared. So we only need to handle RMP updates for the UPM/restrictedmem PFNs. Obviously for shared->private conversion before mapping it into the guest, but also for private->shared conversion since we will still get RMP check failures if we try to leave the PFNs as private in the RMP table and map the above-mentioned VMA memory into the guest instead. Will add some more comments around this. > > Anyway, this is the exact kind of thing where I really like a > well-commented helper: > > bool can_install_large_rmp_entry(gfn, order) > { > // small pages, blah blah > if (!order) > return false; > > // The region being updated must be aligned > if (!IS_ALIGNED(gfn, 1 << order)) > return false; > // ... and fit > if (gfn + (1 << order)) > end) > return false; > > return true; > } > > Which gets used like this: > > if (op == SNP_PAGE_STATE_PRIVATE && > can_install_large_rmp_entry(gfn, order)) { > level = ... > } Makes sense, will implement something along these lines. Thanks! -Mike