On Wed, May 10, 2023, Vishal Annapurve wrote: > On Wed, May 10, 2023 at 2:39 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > But I would still like to get this discussed here before we move on. > > > > > > I am wondering if it would make sense to implement > > > restricted_mem/guest_mem file to expose both private and shared memory > > > regions, inline with Kirill's original proposal now that the file > > > implementation is controlled by KVM. > > > > > > Thinking from userspace perspective: > > > 1) Userspace creates guest mem files and is able to mmap them but all > > > accesses to these files result into faults as no memory is allowed to > > > be mapped into userspace VMM pagetables. > > > > Never mapping anything into the userspace page table is infeasible. Technically > > it's doable, but it'd effectively require all of the work of an fd-based approach > > (and probably significantly more), _and_ it'd require touching core mm code. > > > > VMAs don't provide hva=>pfn information, they're the kernel's way of implementing > > the abstraction provided to userspace by mmap(), mprotect() etc. Among many other > > things, a VMA describes properties of what is mapped, e.g. hugetblfs versus > > anonymous, where memory is mapped (virtual address), how memory is mapped, e.g. > > RWX protections, etc. But a VMA doesn't track the physical address, that info > > is all managed through the userspace page tables. > > > > To make it possible to allow userspace to mmap() but not access memory (without > > redoing how the kernel fundamentally manages virtual=>physical mappings), the > > simplest approach is to install PTEs into userspace page tables, but never mark > > them Present in hardware, i.e. prevent actually accessing the backing memory. > > This is is exactly what Kirill's series in link [3] below implemented. > > > > Maybe it's simpler to do when mmaped regions are backed with files. > > I see that shmem has fault handlers for accesses to VMA regions > associated with the files, In theory a file implementation can always > choose to not allocate physical pages for such faults (similar to > F_SEAL_FAULT_AUTOALLOCATE that was discussed earlier). Ah, you're effectively suggesting a hybrid model where the file is the single source of truth for what's private versus shared, ad KVM gets pfns through direct communication with the backing store via the file descriptor, but userspace can still control things via mmap() and friends. If you're not suggesting a backdoor, i.e. KVM still gets private pfns via hvas, then we're back at Kirill's series, because otherwise there's no easy way for KVM to retrieve the pfn. A form of this was also discussed, though I don't know how much of the discussion happened on-list. KVM actually does something like this for s390's Ultravisor (UV), which is quite a bit like TDX (UV is a trusted intermediary) except that it handles faults much, much more gracefully. Specifically, when the untrusted host attempts to access a secure page, a fault occurs and the kernel responds by telling UV to export the page. The fault is gracefully handled even even for kernel accesses (see do_secure_storage_access()). The kernel does BUG() if the export fails when handling fault from kernel context, but my understanding is that export can fail if and only if there's a fatal error elsewhere, i.e. the UV essentialy _ensures_ success, and goes straight to BUG()/panic() if something goes wrong. On the guest side, accesses to exported (swapped) secure pages generate intercepts and KVM faults in the page. To do so, KVM freezes the page/folio refcount, tells the UV to import the page, and then unfreezes the page/folio. But very crucially, when _anything_ in the untrusted host attempts to access the secure page, the above fault handling for untrusted host accesses kicks in. In other words, the guest can cause thrash, but can't bring down the host. TDX on the other hand silently poisons memory, i.e. doesn't even generate a synchronous fault. Thus the kernel needs to be 100% perfect on preventing _any_ accesses to private memory from the host, and doing that is non-trivial and invasive. SNP does synchronously fault, but the automatically converting in the #PF handler got NAK'd[*] for good reasons, e.g. SNP doesn't guarantee conversion success as the guest can trigger concurrent RMP modifications. So the end result ends up being the same as TDX, host accesses need to be completely prevented. Again, this is all doable, but costly. And IMO, provides very little value. Allowing things like mbind() is nice-to-have at best, as implementing fbind() isn't straightforward and arguably valuable to have irrespective of this discussion, e.g. to allow userspace to say "use this policy regardless of what process maps the file". Using a common memory pool (same physical page is used for both shared and private) is a similar story. There are plenty of existing controls to limit userspace/guest memory usage and to deal with OOM scenarios, so barring egregious host accounting and/or limiting bugs, which would affect _all_ VM types, the worst case scenario is that a VM is terminated because host userspace is buggy. On the slip side, using a common pool brings complexity into the kernel, as backing stores would need to be taught to deny access to a subset of pages in their mappings, and in multiple paths, e.g. faults, read()/write() and similar, page migration, swap, etc. [*] https://lore.kernel.org/linux-mm/8a244d34-2b10-4cf8-894a-1bf12b59cf92@xxxxxxxxxxxxxxxx > > Issues that led to us abandoning the "map with special !Present PTEs" approach: > > > > - Using page tables, i.e. hardware defined structures, to track gfn=>pfn mappings > > is inefficient and inflexible compared to software defined structures, especially > > for the expected use cases for CoCo guests. > > > > - The kernel wouldn't _easily_ be able to enforce a 1:1 page:guest association, > > let alone a 1:1 pfn:gfn mapping. > > Maybe KVM can ensure that each page of the guest_mem file is > associated with a single memslot. This is a hard NAK. Guest physical address space is guaranteed to have holes and/or be discontiguous, for the PCI hole at the top of lower memory. Allowing only a single binding would prevent userspace from backing all (or large chunks) of guest memory with a single file. > HVAs when they are registered can be associated with offsets into guest_mem files. Enforcing 1:1 assocations is doable if KVM inserts a shim/interposer, e.g. essentially implements the exclusivity bits of restrictedmem. But that's adding even more complexity. > > - Does not work for memory that isn't backed by 'struct page', e.g. if devices > > gain support for exposing encrypted memory regions to guests. > > > > - Poking into the VMAs to convert memory would be likely be less performant due > > to using infrastructure that is much "heavier", e.g. would require taking > > mmap_lock for write. > > Converting memory doesn't necessarily need to poke holes into VMA, but > rather just unmap pagetables just like what would happen when mmapped > files are punched to free the backing file offsets. Sorry, bad choice of word on my part. I didn't intend to imply poking holes, in this case I used "poking" to mean "modifying". munmap(), mprotected(), etc all require modifying VMAs, which means taking mmap_lock for write. > > In short, shoehorning this into mmap() requires fighting how the kernel works at > > pretty much every step, and in the end, adding e.g. fbind() is a lot easier. > > > > > 2) Userspace registers mmaped HVA ranges with KVM with additional > > > KVM_MEM_PRIVATE flag > > > 3) Userspace converts memory attributes and this memory conversion > > > allows userspace to access shared ranges of the file because those are > > > allowed to be faulted in from guest_mem. Shared to private conversion > > > unmaps the file ranges from userspace VMM pagetables. > > > 4) Granularity of userspace pagetable mappings for shared ranges will > > > have to be dictated by KVM guest_mem file implementation. > > > > > > Caveat here is that once private pages are mapped into userspace view. > > > > > > Benefits here: > > > 1) Userspace view remains consistent while still being able to use HVA ranges > > > 2) It would be possible to use HVA based APIs from userspace to do > > > things like binding. > > > 3) Double allocation wouldn't be a concern since hva ranges and gpa > > > ranges possibly map to the same HPA ranges. > > > > #3 isn't entirely correct. If a different process (call it "B") maps shared memory, > > and then the guest converts that memory from shared to private, the backing pages > > for the previously shared mapping will still be mapped by process B unless userspace > > also ensures process B also unmaps on conversion. > > > > This should be ideally handled by something like: unmap_mapping_range() That'd work for the hybrid model (fd backdoor with pseudo mmap() support), but not for a generic VMA-based implementation. If the file isn't the single source of truth, then forcing all mappings to go away simply can't work. > > #3 is also a limiter. E.g. if a guest is primarly backed by 1GiB pages, keeping > > the 1GiB mapping is desirable if the guest converts a few KiB of memory to shared, > > and possibly even if the guest converts a few MiB of memory. > > This caveat maybe can be lived with as shared ranges most likely will > not be backed by 1G pages anyways, possibly causing IO performance to > get hit. This possibly needs more discussion about conversion > granularity used by guests. Yes, it's not the end of the world. My point is that separating shared and private memory provides more flexibility. Maybe that flexibility never ends up being super important, but at the same time we shouldn't willingly paint ourselves into a corner.