On Tue, 2024-07-30 at 11:15 +0100, David Hildenbrand wrote: >>> Hi, >>> >>> sorry for the late reply. Yes, you could have joined .... too late. >> >> No worries, I did end up joining to listen in to y'all's discussion >> anyway :) > > Sorry for the late reply :( No worries :) >> >>> There will be a summary posted soon. So far the agreement is that we're >>> planning on allowing shared memory as part guest_memfd, and will allow >>> that to get mapped and pinned. Private memory is not going to get mapped >>> and pinned. >>> >>> If we have to disallow pinning of shared memory on top for some use >>> cases (i.e., no directmap), I assume that could be added. >>> >>>> >>>>> Note that just from staring at this commit, I don't understand the >>>>> motivation *why* we would want to do that. >>>> >>>> Fair - I admittedly didn't get into that as much as I probably should >>>> have. In our usecase, we do not have anything that pKVM would (I think) >>>> call "guest-private" memory. I think our memory can be better described >>>> as guest-owned, but always shared with the VMM (e.g. userspace), but >>>> ideally never shared with the host kernel. This model lets us do a lot >>>> of simplifying assumptions: Things like I/O can be handled in userspace >>>> without the guest explicitly sharing I/O buffers (which is not exactly >>>> what we would want long-term anyway, as sharing in the guest_memfd >>>> context means sharing with the host kernel), we can easily do VM >>>> snapshotting without needing things like TDX's TDH.EXPORT.MEM APIs, etc. >>> >>> Okay, so essentially you would want to use guest_memfd to only contain >>> shard memory and disallow any pinning like for secretmem. >> >> Yeah, this is pretty much what I thought we wanted before listening in >> on Wednesday. >> >> I've actually be thinking about this some more since then though. With >> hugepages, if the VM is backed by, say, 2M pages, our on-demand direct >> map insertion approach runs into the same problem that CoCo VMs have >> when they're backed by hugepages: How to deal with the guest only >> sharing a 4K range in a hugepage? If we want to restore the direct map >> for e.g. the page containing kvm-clock data, then we can't simply go >> ahead and restore the direct map for the entire 2M page, because there >> very well might be stuff in the other 511 small guest pages that we >> really do not want in the direct map. And we can't even take the > > Right, you'd only want to restore the direct map for a fragment. Or > dynamically map that fragment using kmap where required (as raised by > Vlastimil). Can the kmap approach work if the memory is supposed to be GUP-able? >> approach of letting the guest deal with the problem, because here >> "sharing" is driven by the host, not the guest, so the guest cannot >> possibly know that it maybe should avoid putting stuff it doesn't want >> shared into those remaining 511 pages! To me that sounds a lot like the >> whole "breaking down huge folios to allow GUP to only some parts of it" >> thing mentioned on Wednesday. > > Yes. While it would be one logical huge page, it would be exposed to the > remainder of the kernel as 512 individual pages. > >> >> Now, if we instead treat "guest memory without direct map entries" as >> "private", and "guest memory with direct map entries" as "shared", then >> the above will be solved by whatever mechanism allows gupping/mapping of >> only the "shared" parts of huge folios, IIUC. The fact that GUP is then >> also allowed for the "shared" parts is not actually a problem for us - >> we went down the route of disabling GUP altogether here because based on >> [1] it sounded like GUP for anything gmem related would never happen. > > Right. Might there also be a case for removing the directmap for shared > memory or is that not really a requirement so far? No, not really - we would only mark as "shared" memory that _needs_ to be in the direct map for functional reasons (e.g. MMIO instruction emulation, etc.). >> But after something is re-inserted into the direct map, we don't very >> much care if it can be GUP-ed or not. In fact, allowing GUP for the >> shared parts probably makes some things easier for us, as we can then do >> I/O without bounce buffers by just in-place converting I/O-buffers to >> shared, and then treating that shared slice of guest_memfd the same way >> we treat traditional guest memory today. > > Yes. > >> In a very far-off future, we'd >> like to be able to do I/O without ever reinserting pages into the direct >> map, but I don't think adopting this private/shared model for gmem would >> block us from doing that? > > How would that I/O get triggered? GUP would require the directmap. I was hoping that this "phyr" thing Matthew has been talking about [1] would allow somehow doing I/O without direct map entries/GUP, but maybe I am misunderstanding something. >> >> Although all of this does hinge on us being able to do the in-place >> shared/private conversion without any guest involvement. Do you envision >> that to be possible? > > Who would trigger the conversion and how? I don't see a reason why -- > for your use case -- user space shouldn't be able to trigger conversion > private <-> shared. At least nothing fundamental comes to mind that > would prohibit that. Either KVM itself would trigger the conversions whenever it wants to access gmem (e.g. each place in this series where there is a set_direct_map_{invalid,default} it would do a shared/private conversion), or userspace would do it via some syscall/ioctl (the one place I can think of right now is I/O, where the VMM receives a virtio buffer from the guest and converts it from private to shared in-place. Although I guess 2 syscalls for each I/O operation aren't great perf-wise, so maybe swiotlb still wins out here?). I actually see that Fuad just posted an RFC series that implements the basic shared/private handling [2], so will probably also comment about this over there after I had a closer look :) > -- > Cheers, > > David / dhildenb Best, Patrick [1]: https://lore.kernel.org/netdev/Yd0IeK5s%2FE0fuWqn@xxxxxxxxxxxxxxxxxxxx/T/ [2]: https://lore.kernel.org/kvm/20240801090117.3841080-1-tabba@xxxxxxxxxx/T/#t