Hey David, I'll try to pick up the baton while Will is away :-) On Thursday 28 Mar 2024 at 10:06:52 (+0100), David Hildenbrand wrote: > On 27.03.24 20:34, Will Deacon wrote: > > I suppose the main thing is that the architecture backend can deal with > > these states, so the core code shouldn't really care as long as it's > > aware that shared memory may be pinned. > > So IIUC, the states are: > > (1) Private: inaccesible by the host, accessible by the guest, "owned by > the guest" > > (2) Host Shared: accessible by the host + guest, "owned by the host" > > (3) Guest Shared: accessible by the host, "owned by the guest" Yup. > Memory ballooning is simply transitioning from (3) to (2), and then > discarding the memory. Well, not quite actually, see below. > Any state I am missing? So there is probably state (0) which is 'owned only by the host'. It's a bit obvious, but I'll make it explicit because it has its importance for the rest of the discussion. And while at it, there are other cases (memory shared/owned with/by the hypervisor and/or TrustZone) but they're somewhat irrelevant to this discussion. These pages are usually backed by kernel allocations, so much less problematic to deal with. So let's ignore those. > Which transitions are possible? Basically a page must be in the 'exclusively owned' state for an owner to initiate a share or donation. So e.g. a shared page must be unshared before it can be donated to someone else (that is true regardless of the owner, host, guest, hypervisor, ...). That simplifies significantly the state tracking in pKVM. > (1) <-> (2) ? Not sure if the direct transition is possible. Yep, not possible. > (2) <-> (3) ? IIUC yes. Actually it's not directly possible as is. The ballooning procedure is essentially a (1) -> (0) transition. (We also tolerate (3) -> (0) in a single hypercall when doing ballooning, but it's technically just a (3) -> (1) -> (0) sequence that has been micro-optimized). Note that state (2) is actually never used for protected VMs. It's mainly used to implement standard non-protected VMs. The biggest difference in pKVM between protected and non-protected VMs is basically that in the former case, in the fault path KVM does a (0) -> (1) transition, but in the latter it's (0) -> (2). That implies that in the unprotected case, the host remains the page owner and is allowed to decide to unshare arbitrary pages, to restrict the guest permissions for the shared pages etc, which paves the way for implementing migration, swap, ... relatively easily. > (1) <-> (3) ? IIUC yes. Yep. <snip> > > I agree on all of these and, yes, (3) is the problem for us. We've also > > been thinking a bit about CoW recently and I suspect the use of > > vm_normal_page() in do_wp_page() could lead to issues similar to those > > we hit with GUP. There are various ways to approach that, but I'm not > > sure what's best. > > Would COW be required or is that just the nasty side-effect of trying to use > anonymous memory? That'd qualify as an undesirable side effect I think. > > > > > I'm curious, may there be a requirement in the future that shared memory > > > could be mapped into other processes? (thinking vhost-user and such things). > > > > It's not impossible. We use crosvm as our VMM, and that has a > > multi-process sandbox mode which I think relies on just that... > > > > Okay, so basing the design on anonymous memory might not be the best choice > ... :/ So, while we're at this stage, let me throw another idea at the wall to see if it sticks :-) One observation is that a standard memfd would work relatively well for pKVM if we had a way to enforce that all mappings to it are MAP_SHARED. KVM would still need to take an 'exclusive GUP' from the fault path (which may fail in case of a pre-existing GUP, but that's fine), but then CoW and friends largely become a non-issue by construction I think. Is there any way we could enforce that cleanly? Perhaps introducing a sort of 'mmap notifier' would do the trick? By that I mean something a bit similar to an MMU notifier offered by memfd that KVM could register against whenever the memfd is attached to a protected VM memslot. One of the nice things here is that we could retain an entire mapping of the whole of guest memory in userspace, conversions wouldn't require any additional efforts from userspace. A bad thing is that a process that is being passed such a memfd may not expect the new semantic and the inability to map !MAP_SHARED. But I guess a process that receives a handle to private memory must be enlightened regardless of the type of fd, so maybe it's not so bad. Thoughts? Thanks, Quentin