Re: [PATCH 00/10] Introduce guestmemfs: persistent in-memory filesystem

David Hildenbrand <david@xxxxxxxxxx> · Tue, 6 Aug 2024 15:43:24 +0200

1. Secret hiding: with guestmemfs all of the memory is out of the kernel
direct map as an additional defence mechanism. This means no
read()/write() syscalls to guestmemfs files, and no IO to it. The only
way to access it is to mmap the file.

There are people interested into similar things for guest_memfd.

2. No struct page overhead: the intended use case is for systems whose
sole job is to be a hypervisor, typically for large (multi-GiB) VMs, so
the majority of system RAM would be donated to this fs. We definitely
don't want 4 KiB struct pages here as it would be a significant
overhead. That's why guestmemfs carves the memory out in early boot and
sets memblock flags to avoid struct page allocation. I don't know if
hugetlbfs does anything fancy to avoid allocating PTE-level struct pages
for its memory?

Sure, it's called HVO and can optimize out a significant portion of the 
vmemmap.

3. guest_memfd interface: For confidential computing use-cases we need
to provide a guest_memfd style interface so that these FDs can be used
as a guest_memfd file in KVM memslots. Would there be interest in
extending hugetlbfs to also support a guest_memfd style interface?

"Extending hugetlbfs" sounds wrong; hugetlbfs is a blast from the past 
and not something people are particularly keen to extend for such use 
cases. :)

Instead, as Jason said, we're looking into letting guest_memfd own and 
manage large chunks of contiguous memory.

4. Metadata designed for persistence: guestmemfs will need to keep
simple internal metadata data structures (limited allocations, limited
fragmentation) so that pages can easily and efficiently be marked as
persistent via KHO. Something like slab allocations would probably be a
no-go as then we'd need to persist and reconstruct the slab allocator. I
don't know how hugetlbfs structures its fs metadata but I'm guessing it
uses the slab and does lots of small allocations so trying to retrofit
persistence via KHO to it may be challenging.

5. Integration with persistent IOMMU mappings: to keep DMA running
across kexec, iommufd needs to know that the backing memory for an IOAS
is persistent too. The idea is to do some DMA pinning of persistent
files, which would require iommufd/guestmemfs integration - would we
want to add this to hugetlbfs?

6. Virtualisation-specific APIs: starting to get a bit esoteric here,
but use-cases like being able to carve out specific chunks of memory
from a running VM and turn it into memory for another side car VM, or
doing post-copy LM via DMA by mapping memory into the IOMMU but taking
page faults on the CPU. This may require virtualisation-specific ioctls
on the files which wouldn't be generally applicable to hugetlbfs.

7. NUMA control: a requirement is to always have correct NUMA affinity.
While currently not implemented the idea is to extend the guestmemfs
allocation to support specifying allocation sizes from each NUMA node at
early boot, and then having multiple mount points, one per NUMA node (or
something like that...). Unclear if this is something hugetlbfs would
want.

There are probably more potential issues, but those are the ones that
come to mind... That being said, if hugetlbfs maintainers are interested
in going in this direction then we can definitely look at enhancing
hugetlbfs.

I think there are two types of problems: "Would hugetlbfs want this
functionality?" - that's the majority. An a few are "This would be hard
with hugetlbfs!" - persistence probably falls into this category.

I'm much rather asking myself if you should instead teach/extend the 
guest_memfd concept by some of what you propose here.

At least "guest_memfd" sounds a lot like the "anonymous fd" based 
variant of guestmemfs ;)

Like we have hugetlbfs and memfd with hugetlb pages.

--
Cheers,

David / dhildenb