Hi all, We had a very interesting discussion today led by James Gowans in the Linux MM Alignment Session, thank you James! And thanks to everybody who attended and provided great questions, suggestions, and feedback. Guestmemfs[*] is proposed to provide an in-memory persistent filesystem primarily aimed at Kexec Hand-Over (KHO) use cases: 1GB allocations, no struct pages, unmapped from the kernel direct map. The memory for this filesystem is set aside by the memblock allocator as defined by the kernel command line (like guestmemfs=900G on a 1TB system). ----->o----- Feedback from David Hildenbrand was that we may want to leverge HVO to get struct page savings and the alignment was to define this as part of the filesystem configuration: do you want all struct pages to be gone and memory unmapped from the kernel direct map, or in the kernel direct map with tail pages freed for I/O? You get to choose! ----->o----- It was noted that the premise for guestmemfs sounded very similar to guest_memfd, a filesystem that would index non-anonymous guest_memfds; indeed, this is not dissimilar to persistent guest_memfd. The new kernel would need to present the fds to userspace so they can be used once again, so a filesystem abstraction may make sense. We may also want to use uid and gid permissions. It's highly desirable for the kernel to share the same infrastructure and source code, like struct page optimizations and unmapping from the kernel direct map, and name the guest_memfd. We'd want to avoid duplicating this, but it's still questionable how this would be glued together. David Hildenbrand brought up the idea of a persistent filesystem that even databases could use that may not be guest_memfd. Persistent filesystems do exist, but lack the 1GB memory allocation requirement; if we were to support databases or other workloads that want to persist memory across kexec, this instead would become a new optimized filesystem for generic use cases that require persistence. Mike Rapoport noted that tying the ability to persist memory across kexec to only guests would preclude this without major changes. Frank van der Linden noted the abstraction between guest_memfd and guestmemfs doesn't mesh very well and we may want to do this at the allocator level instead: basically a factory that gives you exactly what you want -- memory unmapped from the kernel direct map, with HVO instead, etc. Jason Gunthorpe noted there's a desire to add iommufd connections to guest_memfd and that would have to be duplicated for guestmemfs. KVM has special connections to it, ioctls, etc. So likely a whole new API surface is coming around guest_memfd that guestmemfs will want to re-use. To support this, it was also noted that guest_memfd is largely used for confidential computing and pKVM today, and confidential computing is a requirement for cloud providers: they need to expose guest_memfd style interface for such VMs as well. Jason suggested that when you create a file on the filesystem, you tell it exactly what you want: unmapped memory, guest_memfd semantics, or just a plain file. James expanded on this by brainstorming an API for such use cases and backed by this new kind of allocator to provide exactly what you need. ----->o----- James also noted some users are interested in smaller regions of memory that aren't preallocated, like tmpfs, so there is interest in "persistent tmpfs," including dynamic sizing. This may be tricky because tmpfs uses page cache. In this case, the preallocation would not be needed. Mike Rapoport noted the same is the case for memory mapped into the kernel direct map which is not required for persistence (including if you want to do I/O). The tricky part of this is to determine what should and should not be solved with the same solution. Is it acceptable to have something like guestmemfs which is very specific to cloud providers running VMs in most of their host memory? Matthew Wilcox noted there perhaps are ways to support persistence in tmpfs, such as with swap, for this other use case, James noted this could be used for things like systemd information that people have brought up for containerization. He indicated we should ensure KHO can mark tmpfs pages to be persistent. We'd need to follow up with Alex. ----->o----- Pasha Tatashin asked about NUMA support with the current guestmemfs proposal. James noted this would be an essential requirement. When specifying the kernel command line with guestmemfs=, we could specify the lengths required from each NUMA node. This would result in per-node mount points. ----->o----- Peter Xu asked if IOMMU page tables could be stored on the guestmemfs themselves to preserve across kexec. James noted previous solutions for this existed, but were tricky because of filesystem ordering at boot. This led to the conclusion that if we want persistent devices, then we need persistent memory as well; only files from guestmemfs that are known to be persistent can be mapped into a persistent VMA domain. In the case of IOMMU page tables, the IOMMU driver needs to tell KHO that they must be persisted. ----->o----- My takeaway: based on the feedback that was provided in the discussion: - we need an allocator abstraction for persistent memory that can return memory with various characteristics: 1GB or not, kernel direct map or not, HVO or not, etc. - built on top of that, we need the ability to carve out very large ranges of memory (cloud provider use case) with NUMA awareness on the kernel command line - we also need the ability to be able to dynamically resize this or provide hints at allocation time that memory must be persisted across kexec to support the non-cloud provider use case - we need a filesystem abstraction that map memory of the type that is requested, including guest_memfd and then deal with all the fun of multitenancy since it would be drawing from a finite per-NUMA node pool of persistent memory - absolutely critical to this discussion is defining what is the core infrastructure that is required for a generally acceptable solution and then what builds off of that to be more special cased (like the cloud provider use case or persistent tmpfs use case) We're looking to continue that discussion here and then come together again in a few weeks. Thanks! [*] https://lore.kernel.org/kvm/20240805093245.889357-1-jgowans@xxxxxxxxxx/