RE: [PATCH v7 1/3] iommufd: Add data structure for Intel VT-d stage-1 cache invalidation

"Tian, Kevin" <kevin.tian@xxxxxxxxx> · Fri, 24 Nov 2023 03:00:45 +0000

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Wednesday, November 22, 2023 9:26 PM
> 
> On Wed, Nov 22, 2023 at 04:58:24AM +0000, Tian, Kevin wrote:
> > then we just define hwpt 'cache' invalidation in vtd always refers to
> > both iotlb and devtlb. Then viommu just needs to call invalidation
> > uapi once when emulating virtual iotlb invalidation descriptor
> > while emulating the following devtlb invalidation descriptor
> > as a nop.
> 
> In principle ATC and IOMMU TLB invalidations should not always be
> linked.
> 
> Any scenario that allows devices to share an IOTLB cache tag requires
> fewer IOMMU TLB invalidations than ATC invalidations.

as long as the host iommu driver has the same knowledge then it will
always do the right thing.

e.g. one iotlb entry shared by 4 devices.

guest issues:
	1) iotlb invalidation
	2) devtlb invalidation for dev1
	3) devtlb invalidation for dev2
	4) devtlb invalidation for dev3
	5) devtlb invalidation for dev4

intel-viommu calls HWPT cache invalidation for 1) and treats 2-5) as nop.

intel-iommu driver internally knows the iotlb is shared by 4 devices (given
the same domain is attached to those devices) to handle HWPT
cache invalidation:

	1) iotlb invalidation
	2) devtlb invalidation for dev1
	3) devtlb invalidation for dev2
	4) devtlb invalidation for dev3
	5) devtlb invalidation for dev4

this is a good optimization by reducing 5 syscalls to 1, with the 
assumption that the guest shouldn't expect any deterministic
behavior before 5) is completed to bring iotlb/devtlbs in sync.

another alternative is to have guest batch 1-5) in one request which
allows viommu to batch them in one invalidation call too. But
this is an orthogonal optimization in guest which we don't want
to rely on.

> 
> I like the view of this invalidation interface as reflecting the
> actual HW and not trying to be smarter an real HW.

the guest-oriented interface e.g. viommu reflects the HW.

uAPI is kind of viommu internal implementation. IMHO it's not
a bad thing to make it smarter as long as no guest observable
breakage.

> 
> I'm fully expecting that Intel will adopt an direct-DMA flush queue
> like SMMU and AMD have already done as a performance optimization. In
> this world it makes no sense that the behavior of the direct DMA queue
> and driver mediated queue would be different.
> 

that's a orthogonal topic. I don't think the value of direct-DMA flush
queue should prevent possible optimization in the mediation path
(as long as guest-expected deterministic behavior is sustained).