+Christian and Hugh On Wed, Apr 19, 2023, David Hildenbrand wrote: > On 19.04.23 02:47, Sean Christopherson wrote: > > An alternative that we haven't considered since the very early days is making the > > uAPI a KVM ioctl() instead of a memfd() flag or dedicated syscall. Looking at the > > code for "pure shim" implementation[3], that's actually not that crazy of an idea. > > Yes. > > > > > I don't know that I love the idea of burying this in KVM, but there are benefits > > to coupling restrictedmem to KVM (aside from getting out from behind this bikeshed > > that I created). > > Yes, it's all better than jumping through hoops to come up with a bad name > like "restrictedmem". > > > > > The big benefit is that the layer of indirection goes away. That simplifies things > > like enhancing restrictedmem to allow host userspace access for debug purposes, > > batching TLB flushes if a PUNCH_HOLE spans multiple memslots, enforcing exclusive > > access, likely the whole "share with a device" story if/when we get there, etc. > > > > The obvious downsides are that (a) maintenance falls under the KVM umbrella, but > > that's likely to be true in practice regardless of where the code lands, and > > Yes. > > > (b) if another use case comes along, e.g. the Gunyah hypervisor[4][5], we risk > > someone reinventing a similar solution. > > I agree. But if it's as simple as providing an ioctl for that hypervisor > that simply wires up the existing implementation, it's not too bad. > > > > > If we can get Gunyah on board and they don't need substantial changes to the > > restrictedmem implementation, then I'm all for continuing on the path we're on. > > But if Gunyah wants to do their own thing, and the lightweight shim approach is > > viable, then it's awfully tempting to put this all behind a KVM ioctl(). > > Right. Or we still succeed in finding a name that's not as bad as what we > had so far. Okie dokie. So as not to bury the lede: I think we should provide a KVM ioctl() and have KVM provide its own low-level implementation, i.e. not wrap shmem. As much as I don't want to have KVM serving up memory to userspace, by trying to keep KVM out of the memory management business, we're actually making things *more* complex and harder to maintain (and merge). Hugh said something to this effect quite early on[1], it just unfortunately took me a long time to understand the benefits. : QEMU gets fd from opening /dev/kvm_something, uses ioctls (or perhaps : the fallocate syscall interface itself) to allocate and free the memory, : ioctl for initializing some of it too. KVM in control of whether that : fd can be read or written or mmap'ed or whatever, no need to prevent it : in shmem.c, no need for flags, seals, notifications to and fro because : KVM is already in control and knows the history. If shmem actually has : value, call into it underneath - somewhat like SysV SHM, and /dev/zero : mmap, and i915/gem make use of it underneath. If shmem has nothing to : add, just allocate and free kernel memory directly, recorded in your : own xarray. Christian also suggested that we stop trying to be lazy and create a proper API[2]. : I find this hard to navigate tbh and it feels like taking a shortcut to : avoid building a proper api. If you only care about a specific set of : operations specific to memfd restricte that needs to be available to : in-kernel consumers, I wonder if you shouldn't just go one step further : then your proposal below and build a dedicated minimal ops api. Idk, : sketching like a madman on a drawning board here with no claim to : feasibility from a mm perspective whatsoever The key point from Hugh is that, except for a few minor things that are trivially easy to replicate, the things that make shmem "shmem" don't provide any value for the KVM use case: - We have no plans to support swap, and migration support is dubious at best. Swap in particular brings in a lot of complexity for no benefit (to us). That complexity doesn't mean depending on shmem is inherently bad, but it does mean rolling our own implementation is highly unlikely to result in reinventing shmem's wheel. - Sharing a backing file between arbitrary process is _unwanted_ for the initial use cases. There may come a time when mutually trusted VMs can share "private" data, but (a) that's a distant future problem and (b) it'll likely require even more KVM control over the memory. - All of the interfaces for read/write/mmap/etc. are dead weight from our perspective. Even worse, we have to actively avoid those interfaces. That can kinda sorta be done without polluting shmem code by using a shim, but that has problems of its own (see below). And Christian pointed out several flaws with wrapping shmem: - Implementing a partial overlay filesystem leads to inconsistencies because only some of the ops are changed, e.g. poking at the inode_operations or super_operations will show shmem stuff, whereas address_space_operations and file_operations will show restrictedmem stuff. - Usurping just f_ops and a_ops without creating a full blown filesystem avoids the partial overlay issues, but partially overriding ops isn't really supported (because it's weird and brittle), e.g. blindly forwarding a fallocate() after restrictedmem does it's thing "works", but only because we very carefully and deliberately designed restrictedmem on top of shmem. On top of the points raised by Hugh and Christian, wrapping shmem isn't really any simpler, just different. E.g. there's a very subtle bug in the shim variant: by passing SGP_WRITE, shmem skips zeroing the page because restrictemem is telling shmem that it (restrictedmem) will immediately write the page. For TDX and SNP, that's a "feature" because in those cases trusted firmware will zero (or init) memory when it's assigned to the guest, but it's a nasty flaw for other use cases. I'm not saying that we'll magically avoid such bugs by avoiding shmem, just pointing out that using shmem requires understanding exactly how shmem works, i.e. using shmem isn't necessarily any easier than building directly on filemap and/or folio APIs. And I gotta imagine it will be a similar story if/when we want to add hugetlbfs support. Building on filemap/folio will definitely have its own challenges, but after prototyping an implementation I can confidently say that none of the problems will be insurmountable. KVM has _way_ more complex code in its memslot and MMU implementations. And another benefit of building on filemap and/or folio APIs is that, because we would be reinventing the wheel to some extent, when we do inevitablly run into problems, it will be easier to get help solving those problems because (a) we won't be doing weird things no one wants to deal with and (b) because the problems will likely be things others have already dealt with. The other angle I've been looking at is whether or not having KVM provide its own implementation will lead to maintenance problems in the future, specifically if we get to the point where we want to support "fancy" things like swap and migration. For migration, I'm quite confident that a dedicated KVM ioctl() versus wrapping shmem would be at worst a wash, and more than likely simpler if KVM owns everything. E.g. migrating pages under TDX and SNP requires invoking magic instructions, and so we'd be overriding ->migrate_folio() no matter what. As for swap, I think we should put a stake in the ground and say that KVM will never support swap for KVM's ioctl(). Sooo much of the infrastructure around swap/reclaim is tied to userspace mappings, e.g. knowing which pages are LRU and/or cold. I poked around a bit to see how we could avoid reinventing all of that infrastructure for fd-only memory, and the best idea I could come up with is basically a rehash of Kirill's very original "KVM protected memory" RFC[3], i.e. allow "mapping" fd-only memory, but ensure that memory is never actually present from hardware's perspective. But on top of the various problems with that approach, the only use cases I can think of for using fd-only to back non-confidential VMs is to guard against spurious writes/reads to guest memory and/or avoid memory overhead for mapping guest memory into the user page tables. Avoiding memory overhead is completely defeated if the page tables are populated PROT_NONE, which just leaves the "harden against guest data corruption use case". And for that specific use case, _if_ swap is desirable, it would be far, far easier to teach the kernel to allow KVM to follow PROT_NONE (as Kirill's series did), as opposed to trying to teach the kernel and/or KVM how to swap fd-only memory. In other words, fd-only memory is purely for slice-of-hardware VMs. If someone wants to overcommit VMs, then they use KVM's traditional API for mapping memory into the guest. Regarding putting this in KVM, as mentioned (way) above in the quote, the worst case scenario of making this a KVM ioctl() seems to be that KVM will end up with an ioctl() that redirects to common kernel code. On the flip side, implementing this in KVM gives us a much clearer path to getting all of this merged. There will be a few small non-KVM patches, but they will either be trivial, e.g. exporting APIs (that aren't contentious) or not strictly required, e.g. adding a flag to mark an address space as completely unmigratable (for improved performance). I.e. the non-KVM patches will be small and be very actionable for their maintainers, and other than that we control our own destiny. And on the topic of control, making this a KVM ioctl() and implementation gives KVM a _lot_ of control. E.g. we can make the file size immutable to simplify the implementation, bind the fd to a single VM at creation time, add KVM-defined flags for controlling hugepage behavior, etc. Last but not least, I think forcing us KVM folks to get our hands at least a little dirty with MM and FS stuff would be a *good* thing. KVM has had more than a few bugs that we missed in no small part because most of the people that work on KVM have almost zero knowledge of MM and FS, and especially at the boundaries between those two. Note, I implemented the POC on top of the filemap APIs because that was much faster and less error prone than re-implementing xarray management myself. We may or may not want to actually make the kernel at large aware of these allocations, i.e. it may be better to follow Hugh's suggestion and use the folio APIs directly instead of piggybacking filemap+folios. E.g. I _think_ migration becomes a complete non-issue if the pages are "raw" allocations and never get marked as LRU-friendly. What I'm not so sure about is if there is anything substantial that we'd lose out on by not reusing the filemap stuff. The POC also doesn't play nice with Ackerley's patches to allocate files on a user-defined mount. AIUI, that's largely a non-issue as something like fbind() would provide a superset of that functionality and more than one maintainer has expressed (handwavy) support for a generic fbind(). Code is available here if folks want to take a look before any kind of formal posting: https://github.com/sean-jc/linux.git x86/kvm_gmem_solo [1] https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@xxxxxxxxxx [2] https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner [3] https://lore.kernel.org/linux-mm/20200522125214.31348-1-kirill.shutemov@xxxxxxxxxxxxxxx