Re: s390-iommu.c default domain conversion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2022-05-20 16:17, Niklas Schnelle wrote:
On Fri, 2022-05-20 at 10:44 -0300, Jason Gunthorpe wrote:
On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote:

I did some testing and created a prototype that gets rid of
arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU
driver. It looks like the existing dma-iommu code allows us to do this
with relatively simple changes to the IOMMU driver only, mostly just
implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so
that's great. They also do seem to map quite well to our RPCIT I/O TLB
flush so that's great. For now the prototype still uses 4k pages only.

You are going to want to improve that page sizes in the iommu driver
anyhow for VFIO.

Ok, we'll look into this.

With that the performance on the LPAR machine hypervisor (no paging) is
on par with our existing code. On paging hypervisors (z/VM and KVM)
i.e. with the hypervisor shadowing the I/O translation tables, it's
still slower than our existing code and interestingly strict mode seems
to be better than lazy here. One thing I haven't done yet is implement
the map_pages() operation or adding larger page sizes.

map_pages() speeds thiings up if there is contiguous memory, I'm not
sure what work load you are testing with so hard to guess if that is
interesting or not.

Our most important driver is mlx5 with both IP and RDMA traffic on
ConnectX-4/5/6 but we also support NVMes.

Since you already have the loop in s390_iommu_update_trans(), updating to map/unmap_pages should be trivial, and well worth it.

Maybe you have some tips what you'd expect to be most beneficial?
Either way we're optimistic this can be solved and this conversion
will be a high ranking item on my backlog going forward.

I'm not really sure I understand the differences, do you have a sense
what is making it slower? Maybe there is some small feature that can
be added to the core code? It is very strange that strict is faster,
that should not be, strict requires synchronous flush in the unmap
cas, lazy does not. Are you sure you are getting the lazy flushes
enabled?

The lazy flushes are the timer triggered flush_iotlb_all() in
fq_flush_iotlb(), right? I definitely see that when tracing my
flush_iotlb_all() implementation via that path. That flush_iotlb_all()
in my prototype is basically the same as the global RPCIT we did once
we wrapped around our IOVA address space. I suspect that this just
happens much more often with the timer than our wrap around and
flushing the entire aperture is somewhat slow because it causes the
hypervisor to re-examine the entire I/O translation table. On the other
hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap()
always flushes a relatively small contiguous range as I'm using the
following construct to extend gather:

	if (iommu_iotlb_gather_is_disjoint(gather, iova, size))
		iommu_iotlb_sync(domain, gather);

	iommu_iotlb_gather_add_range(gather, iova, size);

Maybe the smaller contiguous ranges just help with locality/caching
because the flushed range in the guests I/O tables was just updated.

That's entirely believable - both the AMD and Intel drivers force strict mode when virtualised for similar reasons, so feel free to do the same.

Robin.



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Kernel Development]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite Info]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Linux Media]     [Device Mapper]

  Powered by Linux