Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

Yan Zhao <yan.y.zhao@xxxxxxxxx> · Tue, 5 Dec 2023 11:51:49 +0800

On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
> On Sat, Dec 02, 2023, Yan Zhao wrote:
> Please list out the pros and cons for each.  In the cons column for piggybacking
> KVM's page tables:
> 
>  - *Significantly* increases the complexity in KVM
The complexity to KVM (up to now) are
a. fault in non-vCPU context
b. keep exported root always "active"
c. disallow non-coherent DMAs
d. movement of SPTE_MMU_PRESENT

for a, I think it's accepted, and we can see eager page split allocates
       non-leaf pages in non-vCPU context already.
for b, it requires exported TDP root to keep "active" in KVM's "fast zap" (which
       invalidates all active TDP roots). And instead, the exported TDP's leaf
       entries are all zapped.
       Though it looks not "fast" enough, it avoids an unnecessary root page
       zap, and it's actually not frequent --
       - one for memslot removal (IO page fault is unlikey to happen during VM
                                  boot-up)
       - one for MMIO gen wraparound (which is rare)
       - one for nx huge page mode change (which is rare too)
for c, maybe we can work out a way to remove the MTRR stuffs.
for d, I added a config to turn on/off this movement. But right, KVM side will
       have to sacrifice a bit for software usage and take care of it when the
       config is on.

>  - Puts constraints on what KVM can/can't do in the future (see the movement
>    of SPTE_MMU_PRESENT).
>  - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion
>    mess, the truly nasty MTRR emulation (which I still hope to delete), the NX
>    hugepage mitigation, etc.
NX hugepage mitigation only exists on certain CPUs. I don't see it in recent
Intel platforms, e.g. SPR and GNR...
We can disallow sharing approach if NX huge page mitigation is enabled.
But if pinning or partial pinning are not involved, nx huge page will only cause
unnecessary zap to reduce performance, but functionally it still works well.

Besides, for the extra IO invalidation involved in TDP zap, I think SVM has the
same issue. i.e. each zap in primary MMU is also accompanied by a IO invalidation.

> 
> Please also explain the intended/expected/targeted use cases.  E.g. if the main
> use case is for device passthrough to slice-of-hardware VMs that aren't memory
> oversubscribed, 
>
The main use case is for device passthrough with all devices supporting full
IOPF.
Opportunistically, we hope it can be used in trusted IO, where TDP are shared
to IO side. So, there's only one page table audit required and out-of-sync
window for mappings between CPU and IO side can also be eliminated.

> > - Unified page table management
> >   The complexity of allocating guest pages per GPAs, registering to MMU
> >   notifier on host primary MMU, sub-page unmapping, atomic page merge/split
> 
> Please find different terminology than "sub-page".  With Sub-Page Protection, Intel
> has more or less established "sub-page" to mean "less than 4KiB granularity".  But
> that can't possibly what you mean here because KVM doesn't support (un)mapping
> memory at <4KiB granularity.  Based on context above, I assume you mean "unmapping
> arbitrary pages within a given range".
>
Ok, sorry for this confusion.
By "sub-page unmapping", I mean atomic huge page splitting and unmapping smaller
range in the previous huge page.