Re: [PATCH RFC 0/5] mm/gup: Introduce exclusive GUP pinning

David Hildenbrand <david@xxxxxxxxxx> · Thu, 20 Jun 2024 22:47:58 +0200

On 20.06.24 22:30, Sean Christopherson wrote:
On Thu, Jun 20, 2024, David Hildenbrand wrote:
On 20.06.24 18:36, Jason Gunthorpe wrote:
On Thu, Jun 20, 2024 at 04:45:08PM +0200, David Hildenbrand wrote:

If we could disallow pinning any shared pages, that would make life a lot
easier, but I think there were reasons for why we might require it. To
convert shared->private, simply unmap that folio (only the shared parts
could possibly be mapped) from all user page tables.

IMHO it should be reasonable to make it work like ZONE_MOVABLE and
FOLL_LONGTERM. Making a shared page private is really no different
from moving it.

And if you have built a VMM that uses VMA mapped shared pages and
short-term pinning then you should really also ensure that the VM is
aware when the pins go away. For instance if you are doing some virtio
thing with O_DIRECT pinning then the guest will know the pins are gone
when it observes virtio completions.

In this way making private is just like moving, we unmap the page and
then drive the refcount to zero, then move it.
Yes, but here is the catch: what if a single shared subpage of a large folio
is (validly) longterm pinned and you want to convert another shared subpage
to private?

Sure, we can unmap the whole large folio (including all shared parts) before
the conversion, just like we would do for migration. But we cannot detect
that nobody pinned that subpage that we want to convert to private.

Core-mm is not, and will not, track pins per subpage.

So I only see two options:

a) Disallow long-term pinning. That means, we can, with a bit of wait,
    always convert subpages shared->private after unmapping them and
    waiting for the short-term pin to go away. Not too bad, and we
    already have other mechanisms disallow long-term pinnings (especially
    writable fs ones!).

I don't think disallowing _just_ long-term GUP will suffice, if we go the "disallow
GUP" route than I think it needs to disallow GUP, period.  Like the whole "GUP
writes to file-back memory" issue[*], which I think you're alluding to, short-term
GUP is also problematic.  But unlike file-backed memory, for TDX and SNP (and I
think pKVM), a single rogue access has a high probability of being fatal to the
entire system.

Disallowing short-term should work, in theory, because the 
writes-to-fileback has different issues (the PIN is not the problem but 
the dirtying).

It's more related us not allowing long-term pins for FSDAX pages, 
because the lifetime of these pages is determined by the FS.

What we would do is

1) Unmap the large folio completely and make any refaults block.
-> No new pins can pop up

2) If the folio is pinned, busy-wait until all the short-term pins are
   gone.

3) Safely convert the relevant subpage from shared -> private

Not saying it's the best approach, but it should be doable.

I.e. except for blatant bugs, e.g. use-after-free, we need to be able to guarantee
with 100% accuracy that there are no outstanding mappings when converting a page
from shared=>private.  Crossing our fingers and hoping that short-term GUP will
have gone away isn't enough.

We do have the mapcount and the refcount that will be completely 
reliable for our cases.

folio_mapcount()==0 not mapped

folio_ref_count()==1 we hold the single folio reference. (-> no mapping, 
no GUP, no unexpected references)

(folio_maybe_dma_pinned() could be used as well, but things like 
vmsplice() and some O_DIRECT might still take references. 
folio_ref_count() is more reliable in that regard)

[*] https://lore.kernel.org/all/cover.1683235180.git.lstoakes@xxxxxxxxx

b) Expose the large folio as multiple 4k folios to the core-mm.

b) would look as follows: we allocate a gigantic page from the (hugetlb)
reserve into guest_memfd. Then, we break it down into individual 4k folios
by splitting/demoting the folio. We make sure that all 4k folios are
unmovable (raised refcount). We keep tracking internally that these 4k
folios comprise a single large gigantic page.

Core-mm can track for us now without any modifications per (previously
subpage,) now small folios GUP pins and page table mappings without
modifications.

Once we unmap the gigantic page from guest_memfd, we recronstruct the
gigantic page and hand it back to the reserve (only possible once all pins
are gone).

We can still map the whole thing into the KVM guest+iommu using a single
large unit, because guest_memfd knows the origin/relationship of these
pages. But we would only map individual pages into user page tables (unless
we use large VM_PFNMAP mappings, but then also pinning would not work, so
that's likely also not what we want).

Not being to map guest_memfd into userspace with 1GiB mappings should be ok, at
least for CoCo VMs.  If the guest shares an entire 1GiB chunk, e.g. for DMA or
whatever, then userspace can simply punch a hole in guest_memfd and allocate 1GiB
of memory from regular memory.  Even losing 2MiB mappings should be ok.

For non-CoCo VMs, I expect we'll want to be much more permissive, but I think
they'll be a complete non-issue because there is no shared vs. private to worry
about.  We can simply allow any and all userspace mappings for guest_memfd that is
attached to a "regular" VM, because a misbehaving userspace only loses whatever
hardening (or other benefits) was being provided by using guest_memfd.  I.e. the
kernel and system at-large isn't at risk.

The downside is that we won't benefit from vmemmap optimizations for large
folios from hugetlb, and have more tracking overhead when mapping individual
pages into user page tables.

Hmm, I suspect losing the vmemmap optimizations would be acceptable, especially
if we could defer the shattering until the guest actually tried to partially
convert a 1GiB/2MiB region, and restore the optimizations when the memory is
converted back.

We can only shatter/collapse if there are no unexpected folio 
references. So GUP would have to be handles as well ... so that is 
certainly problematic.

--
Cheers,

David / dhildenb