On Wed, Mar 27, 2024, Will Deacon wrote: > Hi again, David, > > On Fri, Mar 22, 2024 at 06:52:14PM +0100, David Hildenbrand wrote: > > On 19.03.24 15:31, Will Deacon wrote: > > sorry for the late reply! > > Bah, you and me both! Hold my beer ;-) > > > On Tue, Mar 19, 2024 at 11:26:05AM +0100, David Hildenbrand wrote: > > > > On 19.03.24 01:10, Sean Christopherson wrote: > > > > > On Mon, Mar 18, 2024, Vishal Annapurve wrote: > > > > > > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > From the pKVM side, we're working on guest_memfd primarily to avoid > > > diverging from what other CoCo solutions end up using, but if it gets > > > de-featured (e.g. no huge pages, no GUP, no mmap) compared to what we do > > > today with anonymous memory, then it's a really hard sell to switch over > > > from what we have in production. We're also hoping that, over time, > > > guest_memfd will become more closely integrated with the mm subsystem to > > > enable things like hypervisor-assisted page migration, which we would > > > love to have. > > > > Reading Sean's reply, he has a different view on that. And I think that's > > the main issue: there are too many different use cases and too many > > different requirements that could turn guest_memfd into something that maybe > > it really shouldn't be. > > No argument there, and we're certainly not tied to any specific > mechanism on the pKVM side. Maybe Sean can chime in, but we've > definitely spoken about migration being a goal in the past, so I guess > something changed since then on the guest_memfd side. What's "hypervisor-assisted page migration"? More specifically, what's the mechanism that drives it? I am not opposed to page migration itself, what I am opposed to is adding deep integration with core MM to do some of the fancy/complex things that lead to page migration. Another thing I want to avoid is taking a hard dependency on "struct page", so that we can have line of sight to eliminating "struct page" overhead for guest_memfd, but that's definitely a more distant future concern. > > This makes sense: shared memory is neither nasty nor special. You can > > migrate it, swap it out, map it into page tables, GUP it, ... without any > > issues. > > Slight aside and not wanting to derail the discussion, but we have a few > different types of sharing which we'll have to consider: > > * Memory shared from the host to the guest. This remains owned by the > host and the normal mm stuff can be made to work with it. This seems like it should be !guest_memfd, i.e. can't be converted to guest private (without first unmapping it from the host, but at that point it's completely different memory, for all intents and purposes). > * Memory shared from the guest to the host. This remains owned by the > guest, so there's a pin on the pages and the normal mm stuff can't > work without co-operation from the guest (see next point). Do you happen to have a list of exactly what you mean by "normal mm stuff"? I am not at all opposed to supporting .mmap(), because long term I also want to use guest_memfd for non-CoCo VMs. But I want to be very conservative with respect to what is allowed for guest_memfd. E.g. host userspace can map guest_memfd, and do operations that are directly related to its mapping, but that's about it. > * Memory relinquished from the guest to the host. This actually unmaps > the pages from the host and transfers ownership back to the host, > after which the pin is dropped and the normal mm stuff can work. We > use this to implement ballooning. > > I suppose the main thing is that the architecture backend can deal with > these states, so the core code shouldn't really care as long as it's > aware that shared memory may be pinned. > > > So if I would describe some key characteristics of guest_memfd as of today, > > it would probably be: > > > > 1) Memory is unmovable and unswappable. Right from the beginning, it is > > allocated as unmovable (e.g., not placed on ZONE_MOVABLE, CMA, ...). > > 2) Memory is inaccessible. It cannot be read from user space, the > > kernel, it cannot be GUP'ed ... only some mechanisms might end up > > touching that memory (e.g., hibernation, /proc/kcore) might end up > > touching it "by accident", and we usually can handle these cases. > > 3) Memory can be discarded in page granularity. There should be no cases > > where you cannot discard memory to over-allocate memory for private > > pages that have been replaced by shared pages otherwise. > > 4) Page tables are not required (well, it's an memfd), and the fd could > > in theory be passed to other processes.o More broadly, no VMAs are required. The lack of stage-1 page tables are nice to have; the lack of VMAs means that guest_memfd isn't playing second fiddle, e.g. it's not subject to VMA protections, isn't restricted to host mapping size, etc. > > Having "ordinary shared" memory in there implies that 1) and 2) will have to > > be adjusted for them, which kind-of turns it "partially" into ordinary shmem > > again. > > Yes, and we'd also need a way to establish hugepages (where possible) > even for the *private* memory so as to reduce the depth of the guest's > stage-2 walk. Yeah, hugepage support for guest_memfd is very much a WIP. Getting _something_ is easy, getting the right thing is much harder.