On Thu, 2024-10-17 at 10:23 +0530, Vishal Annapurve wrote: > On Mon, Aug 5, 2024 at 3:03 PM James Gowans <jgowans@xxxxxxxxxx> wrote: > > > > In this patch series a new in-memory filesystem designed specifically > > for live update is implemented. Live update is a mechanism to support > > updating a hypervisor in a way that has limited impact to running > > virtual machines. This is done by pausing/serialising running VMs, > > kexec-ing into a new kernel, starting new VMM processes and then > > deserialising/resuming the VMs so that they continue running from where > > they were. To support this, guest memory needs to be preserved. > > > > Guestmemfs implements preservation acrosss kexec by carving out a large > > contiguous block of host system RAM early in boot which is then used as > > the data for the guestmemfs files. As well as preserving that large > > block of data memory across kexec, the filesystem metadata is preserved > > via the Kexec Hand Over (KHO) framework (still under review): > > https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx/ > > > > Filesystem metadata is structured to make preservation across kexec > > easy: inodes are one large contiguous array, and each inode has a > > "mappings" block which defines which block from the filesystem data > > memory corresponds to which offset in the file. > > > > There are additional constraints/requirements which guestmemfs aims to > > meet: > > > > 1. Secret hiding: all filesystem data is removed from the kernel direct > > map so immune from speculative access. read()/write() are not supported; > > the only way to get at the data is via mmap. > > > > 2. Struct page overhead elimination: the memory is not managed by the > > buddy allocator and hence has no struct pages. > > > > 3. PMD and PUD level allocations for TLB performance: guestmemfs > > allocates PMD-sized pages to back files which improves TLB perf (caveat > > below!). PUD size allocations are a next step. > > > > 4. Device assignment: being able to use guestmemfs memory for > > VFIO/iommufd mappings, and allow those mappings to survive and continue > > to be used across kexec. > > > > > > Next steps > > ========= > > > > The idea is that this patch series implements a minimal filesystem to > > provide the foundations for in-memory persistent across kexec files. > > One this foundation is in place it will be extended: > > > > 1. Improve the filesystem to be more comprehensive - currently it's just > > functional enough to demonstrate the main objective of reserved memory > > and persistence via KHO. > > > > 2. Build support for iommufd IOAS and HWPT persistence, and integrate > > that with guestmemfs. The idea is that if VMs have DMA devices assigned > > to them, DMA should continue running across kexec. A future patch series > > will add support for this in iommufd and connect iommufd to guestmemfs > > so that guestmemfs files can remain mapped into the IOMMU during kexec. > > > > 3. Support a guest_memfd interface to files so that they can be used for > > confidential computing without needing to mmap into userspace. > > I am guessing this goal was before we discussed the need of supporting > mmap on guest_memfd for confidential computing usecases to support > hugepages [1]. This series [1] as of today tries to leverage hugetlb > allocator functionality to allocate huge pages which seems to be along > the lines of what you are aiming for. There are also discussions to > support NUMA mempolicy [2] for guest memfd. In order to use > guest_memfd to back non-confidential VMs with hugepages, core-mm will > need to support PMD/PUD level mappings in future. > > David H's suggestion from the other thread to extend guest_memfd to > support guest memory persistence over kexec instead of introducing > guestmemfs as a parallel subsystem seems appealing to me. I think there is a lot of overlap with the huge page goals for guest_memfd. Especially the 1 GiB allocations; that also needs a custom allocator to be able to allocate chunks from something other than core MM buddy allocator. My rough plan is to rebase on top of the 1 GiB guest_memfd support code, and add guestmemfs as another allocator, very similar to hugetlbfs 1 GiB allocations. I still need to engage on the hugetlb(fs?) allocator patch series, but I think in concept it's all going in the right direction for this persistence use case too. JG > > [1] https://lore.kernel.org/kvm/cover.1726009989.git.ackerleytng@xxxxxxxxxx/T/ > [2] https://lore.kernel.org/kvm/47476c27-897c-4487-bcd2-7ef6ec089dd1@xxxxxxx/T/