On Fri, 2022-05-20 at 12:56 -0300, Jason Gunthorpe wrote: > On Fri, May 20, 2022 at 05:17:05PM +0200, Niklas Schnelle wrote: > > > > > With that the performance on the LPAR machine hypervisor (no paging) is > > > > on par with our existing code. On paging hypervisors (z/VM and KVM) > > > > i.e. with the hypervisor shadowing the I/O translation tables, it's > > > > still slower than our existing code and interestingly strict mode seems > > > > to be better than lazy here. One thing I haven't done yet is implement > > > > the map_pages() operation or adding larger page sizes. > > > > > > map_pages() speeds thiings up if there is contiguous memory, I'm not > > > sure what work load you are testing with so hard to guess if that is > > > interesting or not. > > > > Our most important driver is mlx5 with both IP and RDMA traffic on > > ConnectX-4/5/6 but we also support NVMes. > > So you probably won't see big gains here from larger page sizes unless > you also have a specific userspace that is trigger huge pages. > > qemu users spaces do this so it is worth doing anyhow though. > > > > > Maybe you have some tips what you'd expect to be most beneficial? > > > > Either way we're optimistic this can be solved and this conversion > > > > will be a high ranking item on my backlog going forward. > > > > > > I'm not really sure I understand the differences, do you have a sense > > > what is making it slower? Maybe there is some small feature that can > > > be added to the core code? It is very strange that strict is faster, > > > that should not be, strict requires synchronous flush in the unmap > > > cas, lazy does not. Are you sure you are getting the lazy flushes > > > enabled? > > > > The lazy flushes are the timer triggered flush_iotlb_all() in > > fq_flush_iotlb(), right? I definitely see that when tracing my > > flush_iotlb_all() implementation via that path. That flush_iotlb_all() > > in my prototype is basically the same as the global RPCIT we did once > > we wrapped around our IOVA address space. I suspect that this just > > happens much more often with the timer than our wrap around and > > flushing the entire aperture is somewhat slow because it causes the > > hypervisor to re-examine the entire I/O translation table. On the other > > hand in strict mode the iommu_iotlb_sync() call in __iommu_unmap() > > always flushes a relatively small contiguous range as I'm using the > > following construct to extend gather: > > > > if (iommu_iotlb_gather_is_disjoint(gather, iova, size)) > > iommu_iotlb_sync(domain, gather); > > > > iommu_iotlb_gather_add_range(gather, iova, size); > > > > Maybe the smaller contiguous ranges just help with locality/caching > > because the flushed range in the guests I/O tables was just updated. > > So, from what I can tell, the S390 HW is not really the same as a > normal iommu in that you can do map over IOVA that hasn't been flushed > yet and the map will restore coherency to the new page table > entries. I see the zpci_refresh_trans() call in map which is why I > assume this? The zpci_refresh_trans() in map is only there because previously we didn't implement iotlb_sync_map(). Also, we only need to flush on map for the paged guest case so the hypervisor can update its shadow table. It happens unconditionally in the existing s390_iommu.c because that was not well optimized and uses the same s390_iommu_update_trans() for map and unmap. We had the skipping of the TLB flush handled properly in the arch/s390/pci_dma.c mapping code where !zdev->tlb_refresh indicates that we don't need flushes on map. > > (note that normal HW has a HW IOTLB cache that MUST be flushed or new > maps will not be loaded by the HW, so mapping to areas that previously > had uninvalidated IOVA is a functional problem, which motivates the > design of this scheme) We do need to flush the TLBs on unmap. The reason is that under LPAR (non paging hypervisor) the hardware can establish a new mapping on its own if an I/O PTE is changed from invalid to a valid translation and it wasn't previously in the TLB. I think that's how most hardware IOMMUs work and how I understand your explanation too. > > However, since S390 can restore coherency during map the lazy > invalidation is not for correctness but only for security - to > eventually unmap things that the DMA device should not be > touching? As explained above it is for correctness but with the existing code we handle this slightly differently. As we go through the entire IOVA space we're never reusing a previously unmapped IOVA until we run out of IOVAs. Then we do a global flush which on LPAR just drops the hardware TLBs making sure that future re-uses of IOVAs will trigger a harware walk of the I/O translation tables. Same constraints just a different scheme.