Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 4 Dec 2023 13:30:28 -0400

On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:

> There are more approaches beyond having IOMMUFD and KVM be
> completely separate entities.  E.g. extract the bulk of KVM's "TDP
> MMU" implementation to common code so that IOMMUFD doesn't need to
> reinvent the wheel.

We've pretty much done this already, it is called "hmm" and it is what
the IO world uses. Merging/splitting huge page is just something that
needs some coding in the page table code, that people want for other
reasons anyhow.

> - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
>   mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
>   hugepage mitigation, etc.

Does it? I think that just remains isolated in kvm. The output from
KVM is only a radix table top pointer, it is up to KVM how to manage
it still.

> I'm not convinced that memory consumption is all that interesting.  If a VM is
> mapping the majority of memory into a device, then odds are good that the guest
> is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> overhead for pages tables is quite small, especially relative to the total amount
> of memory overheads for such systems.

AFAIK the main argument is performance. It is similar to why we want
to do IOMMU SVA with MM page table sharing.

If IOMMU mirrors/shadows/copies a page table using something like HMM
techniques then the invalidations will mark ranges of IOVA as
non-present and faults will occur to trigger hmm_range_fault to do the
shadowing.

This means that pretty much all IO will always encounter a non-present
fault, certainly at the start and maybe worse while ongoing.

On the other hand, if we share the exact page table then natural CPU
touches will usually make the page present before an IO happens in
almost all cases and we don't have to take the horribly expensive IO
page fault at all.

We were not able to make bi-dir notifiers with with the CPU mm, I'm
not sure that is "relatively easy" :(

Jason