Re: [RFC PATCH v2 12/22] iommufd: Allow mapping from guest_memfd

Michael Roth <michael.roth@xxxxxxx> · Wed, 19 Feb 2025 15:30:37 -0600

On Wed, Feb 19, 2025 at 04:37:08PM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 19, 2025 at 02:23:24PM -0600, Michael Roth wrote:
> > Just for clarity: at least for normal/nested page table (but I'm
> > assuming the same applies to IOMMU mappings), 1G mappings are
> > handled similarly as 2MB mappings as far as RMP table checks are
> > concerned: each 2MB range is checked individually as if it were
> > a separate 2MB mapping:
> 
> Well, IIRC we are dealing with the AMDv1 IO page table here which
> supports more sizes than 1G and we likely start to see things like 4M
> mappings and the like. So maybe there is some issue if the above
> special case really only applies to 1G and only 1G.

I think the documentation only mentioned 1G specifically since that's
the next level up in host/nested page table mappings, and that more
generally anything mapping at a higher granularity than 2MB would be
broken down into individual checks on each 2MB range within. But it's
quite possible things are handled differently for IOMMU so definitely
worth confirming.

> 
> > But the point still stands for 4K RMP entries and 2MB mappings: a 2MB
> > mapping either requires private page RMP entries to be 2MB, or in the
> > case of 2MB mapping of shared pages, every page in the range must be
> > shared according to the corresponding RMP entries.
> 
>  Is 4k RMP what people are running?

Unfortunately yes, but that's mainly due to guest_memfd only handling
4K currently. Hopefully that will change soon, but in the meantime
there's only experimental support for larger private page sizes that
make use of 2MB RMP entries (via THP).

But regardless, we'll still end up dealing with 4K RMP entries since
we'll need to split 2MB RMP entries in response to private->conversions
that aren't 2MB aligned/sized.

> 
> > I think, for the non-SEV-TIO use-case, it had more to do with inability
> > to unmap a 4K range once a particular 4K page has been converted
> 
> Yes, we don't support unmap or resize. The entire theory of operation
> has the IOPTEs cover the guest memory and remain static at VM boot
> time. The RMP alone controls access and handles the static/private.
> 
> Assuming the host used 2M pages the IOPTEs in an AMDv1 table will be
> sized around 2M,4M,8M just based around random luck.
> 
> So it sounds like you can get to a situation with a >=2M mapping in
> the IOPTE but the guest has split it into private/shared at lower
> granularity and the HW cannot handle this?

Remembering more details: the situation is a bit more specific to
guest_memfd. In general, for non-SEV-TIO, everything in the IOMMU will
be always be for shared pages, and because of that the RMP checks don't
impose any additional restrictions on mapping size (a shared page can
be mapped 2MB even if the RMP entry is 4K (the RMP page-size bit only
really applies for private pages)).

The issue with guest_memfd is that it is only used for private pages
(at least until in-place conversion is supported), so when we "convert"
shared pages to private we are essentially discarding those pages and
re-allocating them via guest_memfd, so the mappings for those discarded
pages become stale and need to be removed. But since this can happen
at 4K granularities, we need to map as 4K because we don't have a way
to split them later on (at least, not currently...).

The other approach is to not discard these shared pages after conversion
and just not free them back, which ends up using more host memory, but
allows for larger IOMMU mappings.

> 
> > from shared to private if it was originally installed via a 2MB IOPTE,
> > since the guest could actively be DMA'ing to other shared pages in
> > the 2M range (but we can be assured it is not DMA'ing to a particular 4K
> > page it has converted to private), and the IOMMU doesn't (AFAIK) have
> > a way to atomically split an existing 2MB IOPTE to avoid this. 
> 
> The iommu can split it (with SW help), I'm working on that
> infrastructure right now..
> 
> So you will get a notification that the guest has made a
> private/public split and the iommu page table can be atomically
> restructured to put an IOPTE boundary at the split.
> 
> Then the HW will not see IOPTEs that exceed the shared/private
> granularity of the VM.

That sounds very interesting. It would allow us to use larger IOMMU
mappings even for guest_memfd as it exists today, while still supporting
shared memory discard and avoiding the additional host memory usage
mentioned above. Are there patches available publicly?

Thanks,

Mike

> 
> Jason