On Tue, Aug 20, 2024 at 09:56:10AM -0700, Elliot Berman wrote: > On Mon, Aug 19, 2024 at 01:09:52PM +0300, Mike Rapoport wrote: > > On Mon, Aug 05, 2024 at 11:34:49AM -0700, Elliot Berman wrote: > > > This patch was reworked from Patrick's patch: > > > https://lore.kernel.org/all/20240709132041.3625501-6-roypat@xxxxxxxxxxxx/ > > > > > > While guest_memfd is not available to be mapped by userspace, it is > > > still accessible through the kernel's direct map. This means that in > > > scenarios where guest-private memory is not hardware protected, it can > > > be speculatively read and its contents potentially leaked through > > > hardware side-channels. Removing guest-private memory from the direct > > > map, thus mitigates a large class of speculative execution issues > > > [1, Table 1]. > > > > > > Direct map removal do not reuse the `.prepare` machinery, since > > > `prepare` can be called multiple time, and it is the responsibility of > > > the preparation routine to not "prepare" the same folio twice [2]. Thus, > > > instead explicitly check if `filemap_grab_folio` allocated a new folio, > > > and remove the returned folio from the direct map only if this was the > > > case. > > > > > > The patch uses release_folio instead of free_folio to reinsert pages > > > back into the direct map as by the time free_folio is called, > > > folio->mapping can already be NULL. This means that a call to > > > folio_inode inside free_folio might deference a NULL pointer, leaving no > > > way to access the inode which stores the flags that allow determining > > > whether the page was removed from the direct map in the first place. > > > > > > [1]: https://download.vusec.net/papers/quarantine_raid23.pdf > > > > > > Cc: Patrick Roy <roypat@xxxxxxxxxxxx> > > > Signed-off-by: Elliot Berman <quic_eberman@xxxxxxxxxxx> > > > --- > > > include/linux/guest_memfd.h | 8 ++++++ > > > mm/guest_memfd.c | 65 ++++++++++++++++++++++++++++++++++++++++++++- > > > 2 files changed, 72 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/linux/guest_memfd.h b/include/linux/guest_memfd.h > > > index be56d9d53067..f9e4a27aed67 100644 > > > --- a/include/linux/guest_memfd.h > > > +++ b/include/linux/guest_memfd.h > > > @@ -25,6 +25,14 @@ struct guest_memfd_operations { > > > int (*release)(struct inode *inode); > > > }; > > > > > > +/** > > > + * @GUEST_MEMFD_FLAG_NO_DIRECT_MAP: When making folios inaccessible by host, also > > > + * remove them from the kernel's direct map. > > > + */ > > > +enum { > > > > please name this enum, otherwise kernel-doc wont' be happy > > > > > + GUEST_MEMFD_FLAG_NO_DIRECT_MAP = BIT(0), > > > +}; > > > + > > > /** > > > * @GUEST_MEMFD_GRAB_UPTODATE: Ensure pages are zeroed/up to date. > > > * If trusted hyp will do it, can ommit this flag > > > diff --git a/mm/guest_memfd.c b/mm/guest_memfd.c > > > index 580138b0f9d4..e9d8cab72b28 100644 > > > --- a/mm/guest_memfd.c > > > +++ b/mm/guest_memfd.c > > > @@ -7,9 +7,55 @@ > > > #include <linux/falloc.h> > > > #include <linux/guest_memfd.h> > > > #include <linux/pagemap.h> > > > +#include <linux/set_memory.h> > > > + > > > +static inline int guest_memfd_folio_private(struct folio *folio) > > > +{ > > > + unsigned long nr_pages = folio_nr_pages(folio); > > > + unsigned long i; > > > + int r; > > > + > > > + for (i = 0; i < nr_pages; i++) { > > > + struct page *page = folio_page(folio, i); > > > + > > > + r = set_direct_map_invalid_noflush(page); > > > + if (r < 0) > > > + goto out_remap; > > > + } > > > + > > > + folio_set_private(folio); > > > + return 0; > > > +out_remap: > > > + for (; i > 0; i--) { > > > + struct page *page = folio_page(folio, i - 1); > > > + > > > + BUG_ON(set_direct_map_default_noflush(page)); > > > + } > > > + return r; > > > +} > > > + > > > +static inline void guest_memfd_folio_clear_private(struct folio *folio) > > > +{ > > > + unsigned long start = (unsigned long)folio_address(folio); > > > + unsigned long nr = folio_nr_pages(folio); > > > + unsigned long i; > > > + > > > + if (!folio_test_private(folio)) > > > + return; > > > + > > > + for (i = 0; i < nr; i++) { > > > + struct page *page = folio_page(folio, i); > > > + > > > + BUG_ON(set_direct_map_default_noflush(page)); > > > + } > > > + flush_tlb_kernel_range(start, start + folio_size(folio)); > > > > I think that TLB flush should come after removing pages from the direct map > > rather than after adding them back. > > > > Gunyah flushes the tlb when it removes the stage 2 mapping, so we > skipped it on removal as a performance optimization. I remember seeing > that pKVM does the same (tlb flush for the stage 2 unmap & the > equivalent for x86). Patrick had also done the same in their patches. Strictly from the API perspective, unmapping the pages from the direct map would imply removing potentially stale TLB entries. If all currently anticipated users do it elsewhere, at the very least there should be a huge bold comment. And what's the point of tlb flush after setting the direct map to default? There should not be stale tlb entries for the unmapped pages. > Thanks, > Elliot -- Sincerely yours, Mike.