Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem

David Matlack <dmatlack@xxxxxxxxxx> · Wed, 7 Aug 2024 16:45:07 -0700

Hi James,

On Mon, Aug 5, 2024 at 2:33 AM James Gowans <jgowans@xxxxxxxxxx> wrote:
>
> In this patch series a new in-memory filesystem designed specifically
> for live update is implemented. Live update is a mechanism to support
> updating a hypervisor in a way that has limited impact to running
> virtual machines. This is done by pausing/serialising running VMs,
> kexec-ing into a new kernel, starting new VMM processes and then
> deserialising/resuming the VMs so that they continue running from where
> they were. To support this, guest memory needs to be preserved.

How do you envision VM state (or other userspace state) being
preserved? I guess it could just be regular files on this filesystem
but I wonder if that would become inefficient if the files are
(eventually) backed with PUD-sized allocations.

>
> Guestmemfs implements preservation acrosss kexec by carving out a large
> contiguous block of host system RAM early in boot which is then used as
> the data for the guestmemfs files. As well as preserving that large
> block of data memory across kexec, the filesystem metadata is preserved
> via the Kexec Hand Over (KHO) framework (still under review):
> https://lore.kernel.org/all/20240117144704.602-1-graf@xxxxxxxxxx/
>
> Filesystem metadata is structured to make preservation across kexec
> easy: inodes are one large contiguous array, and each inode has a
> "mappings" block which defines which block from the filesystem data
> memory corresponds to which offset in the file.
>
> There are additional constraints/requirements which guestmemfs aims to
> meet:
>
> 1. Secret hiding: all filesystem data is removed from the kernel direct
> map so immune from speculative access. read()/write() are not supported;
> the only way to get at the data is via mmap.
>
> 2. Struct page overhead elimination: the memory is not managed by the
> buddy allocator and hence has no struct pages.

I'm curious if there any downsides of eliminating struct pages? e.g.
Certain operations/features in the kernel relevant for running VMs
that do not work?

>
> 3. PMD and PUD level allocations for TLB performance: guestmemfs
> allocates PMD-sized pages to back files which improves TLB perf (caveat
> below!). PUD size allocations are a next step.
>
> 4. Device assignment: being able to use guestmemfs memory for
> VFIO/iommufd mappings, and allow those mappings to survive and continue
> to be used across kexec.
>
>
> Next steps
> =========
>
> The idea is that this patch series implements a minimal filesystem to
> provide the foundations for in-memory persistent across kexec files.
> One this foundation is in place it will be extended:
>
> 1. Improve the filesystem to be more comprehensive - currently it's just
> functional enough to demonstrate the main objective of reserved memory
> and persistence via KHO.
>
> 2. Build support for iommufd IOAS and HWPT persistence, and integrate
> that with guestmemfs. The idea is that if VMs have DMA devices assigned
> to them, DMA should continue running across kexec. A future patch series
> will add support for this in iommufd and connect iommufd to guestmemfs
> so that guestmemfs files can remain mapped into the IOMMU during kexec.
>
> 3. Support a guest_memfd interface to files so that they can be used for
> confidential computing without needing to mmap into userspace.
>
> 3. Gigantic PUD level mappings for even better TLB perf.
>
> Caveats
> =======
>
> There are a issues with the current implementation which should be
> solved either in this patch series or soon in follow-on work:
>
> 1. Although PMD-size allocations are done, PTE-level page tables are
> still created. This is because guestmemfs uses remap_pfn_range() to set
> up userspace pgtables. Currently remap_pfn_range() only creates
> PTE-level mappings. I suggest enhancing remap_pfn_range() to support
> creating higher level mappings where possible, by adding pmd_special
> and pud_special flags.

This might actually be beneficial.

Creating PTEs for userspace mappings would make it for UserfaultFD to
intercept at PAGE_SIZE granularity. A big pain point for Google with
using HugeTLB is the inability to use UsefaultFD to intercept at
PAGE_SIZE for the post-copy phase of VM Live Migration.

As long as the memory is physically contiguous it should be possible
for KVM to still map it into the guest with PMD or PUD mappings.
KVM/arm64 already even has support for that for VM_PFNMAP VMAs.