On Fri, 2022-05-20 at 10:44 -0300, Jason Gunthorpe wrote: > On Fri, May 20, 2022 at 03:05:46PM +0200, Niklas Schnelle wrote: > > > I did some testing and created a prototype that gets rid of > > arch/s390/pci_dma.c and works soley via dma-iommu on top of our IOMMU > > driver. It looks like the existing dma-iommu code allows us to do this > > with relatively simple changes to the IOMMU driver only, mostly just > > implementing iotlb_sync(), iotlb_sync_map() and flush_iotlb_all() so > > that's great. They also do seem to map quite well to our RPCIT I/O TLB > > flush so that's great. For now the prototype still uses 4k pages only. > > You are going to want to improve that page sizes in the iommu driver > anyhow for VFIO. Ok, we'll look into this. > > > With that the performance on the LPAR machine hypervisor (no paging) is > > on par with our existing code. On paging hypervisors (z/VM and KVM) > > i.e. with the hypervisor shadowing the I/O translation tables, it's > > still slower than our existing code and interestingly strict mode seems > > to be better than lazy here. One thing I haven't done yet is implement > > the map_pages() operation or adding larger page sizes. > > map_pages() speeds thiings up if there is contiguous memory, I'm not > sure what work load you are testing with so hard to guess if that is > interesting or not. Our most important driver is mlx5 with both IP and RDMA traffic on ConnectX-4/5/6 but we also support NVMes. > > > Maybe you have some tips what you'd expect to be most beneficial? > > Either way we're optimistic this can be solved and this conversion > > will be a high ranking item on my backlog going forward. > > I'm not really sure I understand the differences, do you have a sense > what is making it slower? Maybe there is some small feature that can > be added to the core code? It is very strange that strict is faster, > that should not be, strict requires synchronous flush in the unmap > cas, lazy does not. Are you sure you are getting the lazy flushes > enabled? The lazy flushes are the timer triggered flush_iotlb_all() in fq_flush_iotlb(), right? I definitely see that when tracing my flush_iotlb_all() implementation via that path. That flush_iotlb_all() in my prototype is basically the same as the global RPCIT we did once we wrapped around our IOVA address space. I suspect that this just happens much more often with the timer than our wrap around and flushing the entire aperture is somewhat slow because it causes the hypervisor to re-examine the entire I/O translation table. On the other hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap() always flushes a relatively small contiguous range as I'm using the following construct to extend gather: if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) iommu_iotlb_sync(domain, gather); iommu_iotlb_gather_add_range(gather, iova, size); Maybe the smaller contiguous ranges just help with locality/caching because the flushed range in the guests I/O tables was just updated. > > I also stumbled over the following patch series which I think would > > also help our paging hypervisor cases a lot since it should alleviate > > the cost of shadowing short lived mappings: > > This is quite different than what your current code does though? Yes > > Still, it seems encouraging > > Jason