On 27.03.24 20:34, Will Deacon wrote:
Hi again, David,
On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote:
On 19.03.24 15:31, Will Deacon wrote:
sorry for the late reply!
Bah, you and me both!
This time I'm faster! :)
On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote:
On 19.03.24 01:10, Sean Christopherson wrote:
On Mon, Mar 18, 2024, Vishal Annapurve wrote:
On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@xxxxxxxxxx> wrote:
From the pKVM side, we're working on guest_memfd primarily to avoid
diverging from what other CoCo solutions end up using, but if it gets
de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do
today with anonymous memory, then it's a really hard sell to switch over
from what we have in production. We're also hoping that, over time,
guest_memfd will become more closely integrated with the mm subsystem to
enable things like hypervisor-assisted page migration, which we would
love to have.
Reading Sean's reply, he has a different view on that. And I think that's
the main issue: there are too many different use cases and too many
different requirements that could turn guest_memfd into something that maybe
it really shouldn't be.
No argument there, and we're certainly not tied to any specific
mechanism on the pKVM side. Maybe Sean can chime in, but we've
definitely spoken about migration being a goal in the past, so I guess
something changed since then on the guest_memfd side.
Regardless, from our point of view, we just need to make sure that
whatever we settle on for pKVM does the things we need it to do (or can
at least be extended to do them) and we're happy to implement that in
whatever way works best for upstream, guest_memfd or otherwise.
We're happy to pursue alternative approaches using anonymous memory if
you'd prefer to keep guest_memfd limited in functionality (e.g.
preventing GUP of private pages by extending mapping_flags as per [1]),
but we're equally willing to contribute to guest_memfd if extensions are
welcome.
What do you prefer?
Let me summarize the history:
First off, thanks for piecing together the archaeology...
AMD had its thing running and it worked for them (but I recall it was hacky
:) ).
TDX made it possible to crash the machine when accessing secure memory from
user space (MCE).
So secure memory must not be mapped into user space -- no page tables.
Prototypes with anonymous memory existed (and I didn't hate them, although
hacky), but one of the other selling points of guest_memfd was that we could
create VMs that wouldn't need any page tables at all, which I found
interesting.
Are the prototypes you refer to here based on the old stuff from Kirill?
Yes.
We followed that work at the time, thinking we were going to be using
that before guest_memfd came along, so we've sadly been collecting
out-of-tree patches for a little while :/
:/
There was a bit more to that (easier conversion, avoiding GUP, specifying on
allocation that the memory was unmovable ...), but I'll get to that later.
The design principle was: nasty private memory (unmovable, unswappable,
inaccessible, un-GUPable) is allocated from guest_memfd, ordinary "shared"
memory is allocated from an ordinary memfd.
This makes sense: shared memory is neither nasty nor special. You can
migrate it, swap it out, map it into page tables, GUP it, ... without any
issues.
Slight aside and not wanting to derail the discussion, but we have a few
different types of sharing which we'll have to consider:
Thanks for sharing!
* Memory shared from the host to the guest. This remains owned by the
host and the normal mm stuff can be made to work with it.
Okay, host and guest can access it. We can jut migrate memory around,
swap it out ... like ordinary guest memory today.
* Memory shared from the guest to the host. This remains owned by the
guest, so there's a pin on the pages and the normal mm stuff can't
work without co-operation from the guest (see next point).
Okay, host and guest can access it, but we cannot migrate memory around
or swap it out ... like ordinary guest memory today that is longterm pinned.
* Memory relinquished from the guest to the host. This actually unmaps
the pages from the host and transfers ownership back to the host,
after which the pin is dropped and the normal mm stuff can work. We
use this to implement ballooning.
Okay, so this is essentially just a state transition between the two above.
I suppose the main thing is that the architecture backend can deal with
these states, so the core code shouldn't really care as long as it's
aware that shared memory may be pinned.
So IIUC, the states are:
(1) Private: inaccesible by the host, accessible by the guest, "owned by
the guest"
(2) Host Shared: accessible by the host + guest, "owned by the host"
(3) Guest Shared: accessible by the host, "owned by the guest"
Memory ballooning is simply transitioning from (3) to (2), and then
discarding the memory.
Any state I am missing?
Which transitions are possible?
(1) <-> (2) ? Not sure if the direct transition is possible.
(2) <-> (3) ? IIUC yes.
(1) <-> (3) ? IIUC yes.
There is ongoing work on longterm-pinning memory from a memfd/shmem. So
thinking in terms of my vague "fd guest_memfd + fd pair", that approach
could look like the following:
(1) guest_memfd (could be "with longterm pin")
(2) memfd
(3) memfd with a longterm pin
But again, just some possible idea to make it work with guest_memfd.
So if I would describe some key characteristics of guest_memfd as of today,
it would probably be:
1) Memory is unmovable and unswappable. Right from the beginning, it is
allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...).
2) Memory is inaccessible. It cannot be read from user space, the
kernel, it cannot be GUP'ed ... only some mechanisms might end up
touching that memory (e.g., hibernation, /proc/kcore) might end up
touching it "by accident", and we usually can handle these cases.
3) Memory can be discarded in page granularity. There should be no cases
where you cannot discard memory to over-allocate memory for private
pages that have been replaced by shared pages otherwise.
4) Page tables are not required (well, it's an memfd), and the fd could
in theory be passed to other processes.
Having "ordinary shared" memory in there implies that 1) and 2) will have to
be adjusted for them, which kind-of turns it "partially" into ordinary shmem
again.
Yes, and we'd also need a way to establish hugepages (where possible)
even for the *private* memory so as to reduce the depth of the guest's
stage-2 walk.
Understood, and as discussed, that's a bit more "hairy".
Going back to the beginning: with pKVM, we likely want the following
1) Convert pages private<->shared in-place
2) Stop user space + kernel from accessing private memory in process
context. Likely for pKVM we would only crash the process, which
would be acceptable.
3) Prevent GUP to private memory. Otherwise we could crash the kernel.
4) Prevent private pages from swapout+migration until supported.
I suspect your current solution with anonymous memory gets all but 3) sorted
out, correct?
I agree on all of these and, yes, (3) is the problem for us. We've also
been thinking a bit about CoW recently and I suspect the use of
vm_normal_page() in do_wp_page() could lead to issues similar to those
we hit with GUP. There are various ways to approach that, but I'm not
sure what's best.
Would COW be required or is that just the nasty side-effect of trying to
use anonymous memory?
I'm curious, may there be a requirement in the future that shared memory
could be mapped into other processes? (thinking vhost-user and such things).
It's not impossible. We use crosvm as our VMM, and that has a
multi-process sandbox mode which I think relies on just that...
Okay, so basing the design on anonymous memory might not be the best
choice ... :/
Cheers,
Will
(btw: I'm getting some time away from the computer over Easter, so I'll be
a little slow on email again. Nothing personal!).
Sure, no worries! Enjoy!
--
Cheers,
David / dhildenb