Yes, and I think we might have to revive that discussion, unfortunately.
I started thinking about this, but did not reach a conclusion. Sharing
my thoughts.
The minimum we might need to make use of guest_memfd (v1 or v2 ;) ) not
just for private memory should be:
(1) Have private + shared parts backed by guest_memfd. Either the same,
or a fd pair.
(2) Allow to mmap only the "shared" parts.
(3) Allow in-place conversion between "shared" and "private" parts.
These three were covered (modulo bugs) in the guest_memfd() RFC I'd
sent a while back:
https://lore.kernel.org/all/20240222161047.402609-1-tabba@xxxxxxxxxx/
I remember there was a catch to it (either around mmap or pinning
detection -- or around support for huge pages in the future; maybe these
count as BUGs :) ).
I should probably go back and revisit the whole thing, I was only CCed
on some part of it back then.
(4) Allow migration of the "shared" parts.
We would really like that too, if they allow us :)
A) Convert shared -> private?
* Must not be GUP-pinned
* Must not be mapped
* Must not reside on ZONE_MOVABLE/MIGRATE_CMA
* (must rule out any other problematic folio references that could
read/write memory, might be feasible for guest_memfd)
B) Convert private -> shared?
* Nothing to consider
C) Map something?
* Must not be private
A,B and C were covered (again, modulo bugs) in the RFC.
For ordinary (small) pages, that might be feasible.
(ZONE_MOVABLE/MIGRATE_CMA might be feasible, but maybe we could just not
support them initially)
The real fun begins once we want to support huge pages/large folios and
can end up having a mixture of "private" and "shared" per huge page. But
really, that's what we want in the end I think.
I agree.
Unless we can teach the VM to not convert arbitrary physical memory
ranges on a 4k basis to a mixture of private/shared ... but I've been
told we don't want that. Hm.
There are two big problems with that that I can see:
1) References/GUP-pins are per folio
What if some shared part of the folio is pinned but another shared part
that we want to convert to private is not? Core-mm will not provide the
answer to that: the folio maybe pinned, that's it. *Disallowing* at
least long-term GUP-pins might be an option.
Right.
To get stuff into an IOMMU, maybe a per-fd interface could work, and
guest_memfd would track itself which parts are currently "handed out",
and with which "semantics" (shared vs. private).
[IOMMU + private parts might require that either way? Because, if we
dissallow mmap, how should that ever work with an IOMMU otherwise].
Not sure if IOMMU + private makes that much sense really, but I think
I might not really understand what you mean by this.
A device might be able to access private memory. In the TDX world, this
would mean that a device "speaks" encrypted memory.
At the same time, a device might be able to access shared memory. Maybe
devices can do both?
What do do when converting between private and shared? I think it
depends on various factors (e.g., device capabilities).
[...]
I recall quite some details with memory renting or so on pKVM ... and I
have to refresh my memory on that.
I really would like to get to a place where we could investigate and
sort out all of these issues. It would be good to know though, what,
in principle (and not due to any technical limitations), we might be
allowed to do and expand guest_memfd() to do, and what out of
principle is off the table.
As Jason said, maybe we need a revised model that can handle
[...] private+shared properly.
--
Cheers,
David / dhildenb