Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 4 Dec 2023 09:00:55 -0800

On Sat, Dec 02, 2023, Yan Zhao wrote:
> This RFC series proposes a framework to resolve IOPF by sharing KVM TDP
> (Two Dimensional Paging) page table to IOMMU as its stage 2 paging
> structure to support IOPF (IO page fault) on IOMMU's stage 2 paging
> structure.
> 
> Previously, all guest pages have to be pinned and mapped in IOMMU stage 2 
> paging structures after pass-through devices attached, even if the device
> has IOPF capability. Such all-guest-memory pinning can be avoided when IOPF
> handling for stage 2 paging structure is supported and if there are only
> IOPF-capable devices attached to a VM.
> 
> There are 2 approaches to support IOPF on IOMMU stage 2 paging structures:
> - Supporting by IOMMUFD/IOMMU alone
>   IOMMUFD handles IO page faults on stage-2 HWPT by calling GUPs and then
>   iommu_map() to setup IOVA mappings. (IOAS is required to keep info of GPA
>   to HVA, but page pinning/unpinning needs to be skipped.)
>   Then upon MMU notifiers on host primary MMU, iommu_unmap() is called to
>   adjust IOVA mappings accordingly.
>   IOMMU driver needs to support unmapping sub-ranges of a previous mapped
>   range and take care of huge page merge and split in atomic way. [1][2].
> 
> - Sharing KVM TDP
>   IOMMUFD sets the root of KVM TDP page table (EPT/NPT in x86) as the root
>   of IOMMU stage 2 paging structure, and routes IO page faults to KVM.
>   (This assumes that the iommu hw supports the same stage-2 page table
>   format as CPU.)
>   In this model the page table is centrally managed by KVM (mmu notifier,
>   page mapping, subpage unmapping, atomic huge page split/merge, etc.),
>   while IOMMUFD only needs to invalidate iotlb/devtlb properly.

There are more approaches beyond having IOMMUFD and KVM be completely separate
entities.  E.g. extract the bulk of KVM's "TDP MMU" implementation to common code
so that IOMMUFD doesn't need to reinvent the wheel.

> Currently, there's no upstream code available to support stage 2 IOPF yet.
> 
> This RFC chooses to implement "Sharing KVM TDP" approach which has below
> main benefits:

Please list out the pros and cons for each.  In the cons column for piggybacking
KVM's page tables:

 - *Significantly* increases the complexity in KVM
 - Puts constraints on what KVM can/can't do in the future (see the movement
   of SPTE_MMU_PRESENT).
 - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
   mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
   hugepage mitigation, etc.

Please also explain the intended/expected/targeted use cases.  E.g. if the main
use case is for device passthrough to slice-of-hardware VMs that aren't memory
oversubscribed, 

> - Unified page table management
>   The complexity of allocating guest pages per GPAs, registering to MMU
>   notifier on host primary MMU, sub-page unmapping, atomic page merge/split

Please find different terminology than "sub-page".  With Sub-Page Protection, Intel
has more or less established "sub-page" to mean "less than 4KiB granularity".  But
that can't possibly what you mean here because KVM doesn't support (un)mapping
memory at <4KiB granularity.  Based on context above, I assume you mean "unmapping
arbitrary pages within a given range".

>   are only required to by handled in KVM side, which has been doing that
>   well for a long time.
> 
> - Reduced page faults:
>   Only one page fault is triggered on a single GPA, either caused by IO
>   access or by vCPU access. (compared to one IO page fault for DMA and one
>   CPU page fault for vCPUs in the non-shared approach.)

This would be relatively easy to solve with bi-directional notifiers, i.e. KVM
notifies IOMMUFD when a vCPU faults in a page, and vice versa.

> - Reduced memory consumption:
>   Memory of one page table are saved.

I'm not convinced that memory consumption is all that interesting.  If a VM is
mapping the majority of memory into a device, then odds are good that the guest
is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
overhead for pages tables is quite small, especially relative to the total amount
of memory overheads for such systems.

If a VM is mapping only a small subset of its memory into devices, then the IOMMU
page tables should be sparsely populated, i.e. won't consume much memory.