I wanted to revive this thread based on the mm alignment discussion for guest_memfd. Gunyah's guest_memfd allocates memory via filemap_alloc_folio, identical to KVM's guest_memfd. There's a possiblity of a stage-2 fault when memory is donated to guest VM and Linux incidentally tries to access the donated memory with an unaligned access. This access will cause kernel to panic as it expects to be able to access all memory which has been mapped in stage 1. We don't want to disallow unaligned access simply because Gunyah drivers are enabled. There are two options I see to prevent the stage-2 fault from crashing the kernel: we can fix up the stage-2 fault or ensure that Linux has a S1 table consistent with S2. To do the latter, the obvious solution seemed to be using the set_direct_map functions, but you and Christoph have valid concerns about exporting this to modules since it's a low-level API. One way to avoid exporting the symbols is to make Gunyah a built-in, but I'd like to find a better solution. One way I can think of is to create a "guest_memfd library" that both KVM and Gunyah can use. It abstracts the common bits between the 2 into a built-in module and can be the one to call the set_direct_map functions. I also think the abstraction will also help keep KVM guest_memfd cleaner once we start supporting huge folios (and splitting them). Do KVM and mm folks also see value to using a library-fied guest_memfd? Thanks, Elliot On Thu, Feb 29, 2024 at 05:35:45PM -0800, Elliot Berman wrote: > On Tue, Feb 27, 2024 at 10:49:32AM +0100, David Hildenbrand wrote: > > On 26.02.24 18:27, Elliot Berman wrote: > > > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote: > > > > On 26.02.24 12:06, Christoph Hellwig wrote: > > > > > The point is that we can't we just allow modules to unmap data from > > > > > the kernel mapping, no matter how noble your intentions are. > > > > > > > > I absolutely agree. > > > > > > > > > > Hi David and Chirstoph, > > > > > > Are your preferences that we should make Gunyah builtin only or should add > > > fixing up S2 PTW errors (or something else)? > > > > Having that built into the kernel certainly does sound better than exposing > > that functionality to arbitrary OOT modules. But still, this feels like it > > is using a "too-low-level" interface. > > > > What are your thoughts about fixing up the stage-2 fault instead? I > think this gives mmu-based isolation a slight speed boost because we > avoid modifying kernel mapping. The hypervisor driver (KVM or Gunyah) > knows that the page isn't mapped. Whether we get S2 or S1 fault, the > kernel is likely going to crash, except in the rare case where we want > to fix the exception. In that case, we can modify the S2 fault handler > to call fixup_exception() when appropriate. > > > > > > > Also, do you extend that preference to modifying S2 mappings? This would > > > require any hypervisor driver that supports confidential compute > > > usecases to only ever be builtin. > > > > > > Is your concern about unmapping data from kernel mapping, then module > > > being unloaded, and then having no way to recover the mapping? Would a > > > permanent module be better? The primary reason we were wanting to have > > > it as module was to avoid having driver in memory if you're not a Gunyah > > > guest. > > > > What I didn't grasp from this patch description: is the area where a driver > > would unmap/remap that memory somehow known ahead of time and limited? > > > > How would the driver obtain that memory it would try to unmap/remap the > > direct map of? Simply allocate some pages and then unmap the direct map? > > That's correct. > > > > > For example, we do have mm/secretmem.c, where we unmap the directmap on > > allocation and remap when freeing a page. A nice abstraction on alloc/free, > > so one cannot really do a lot of harm. > > > > Further, we enlightened the remainder of the system about secretmem, such > > that we can detect that the directmap is no longer there. As one example, > > see the secretmem_active() check in kernel/power/hibernate.c. > > > > I'll take a look at this. guest_memfd might be able to use PM notifiers here > instead, but I'll dig in the archives to see why secretmem isn't using that. > > > A similar abstraction would make sense (I remember a discussion about having > > secretmem functionality in guest_memfd, would that help?), but the question > > is "which" memory you want to unmap the direct map of, and how the driver > > became "owner" of that memory such that it would really be allowed to mess > > with the directmap. >