On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote: > We looked into Intel IOMMU performance a while ago and learned a few > useful things. We generally did a parallel 200 thread TCP_RR test, > as this churns through mappings quickly and uses all available cores. > > First, the main bottleneck was software performance[1]. For the Intel IOMMU, *all* we need to do is put a PTE in place. For real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't need to do an IOTLB flush. It's a single 64-bit write of the PTE. All else is software overhead. (I'm deliberately ignoring the stupid chipsets where DMA page tables aren't cache coherent and we need a clflush too. They make me too sad.) > This study preceded the recent patch to break the locks into pools > ("Break up monolithic iommu table/lock into finer graularity pools > and lock"). There were several points of lock contention: > - the RB tree ... > - the RB tree ... > - the RB tree ... > > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at > c15-paper-peleg.pdf) has some promising approaches. The magazine > avoids the RB tree issue. I'm thinking of ditching the RB tree altogether and switching to the allocator in lib/iommu-common.c (and thus getting the benefit of the finer granularity pools). > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free > page table cleanup algorithm could do well. When you say 'dynamic 1:1 mapping', is that the same thing that's been suggested elsewhere — avoiding the IOVA allocator completely by using a virtual address which *matches* the physical address, if that virtual address is available? Simply cmpxchg on the PTE itself, and if it was already set *then* we fall back to the allocator, obviously configured to allocate from a range *higher* than the available physical memory. Jörg has been looking at this too, and was even trying to find space in the PTE for a use count so a given page could be in more than one mapping before we call back to the IOVA allocator. > There are correctness fixes and optimizations left in the > invalidation path: I want strict-ish semantics (a page doesn't go > back into the freelist until the last IOTLB/IOMMU TLB entry is > invalidated) with good performance, and that seems to imply that an > additional page reference should be gotten at dma_map time and put > back at the completion of the IOMMU flush routine. (This is worthy > of much discussion.) We already do something like this for page table pages which are freed by an unmap, FWIW. > Additionally, we can find ways to optimize the flush routine by > realizing that if we have frequent maps and unmaps, it may be because > the device creates and destroys buffers a lot; these kind of > workloads use an IOVA for one event and then never come back. Maybe > TLBs don't do much good and we could just flush the entire IOMMU TLB > [and ATS cache] for that BDF. That would be a very interesting thing to look at. Although it would be nice if we had a better way to measure the performance impact of IOTLB misses — currently we don't have a lot of visibility at all. > 1: We verified that the IOMMU costs are almost entirely software > overheads by forcing software 1:1 mode, where we create page tables > for all physical addresses. We tested using leaf nodes of size 4KB, > of 2MB, and of 1GB. In call cases, there is zero runtime maintenance > of the page tables, and no IOMMU invalidations. We did piles of DMA > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM > addresses. At 4KB page size, we could see some bandwidth slowdown, > but at 2MB and 1GB, there was < 1% performance loss as compared with > IOMMU off. Was this with ATS on or off? With ATS, the cost of the page walk can be amortised in some cases — you can look up the physical address *before* you are ready to actually start the DMA to it, and don't take that latency at the time you're actually moving the data. -- David Woodhouse Open Source Technology Centre David.Woodhouse@xxxxxxxxx Intel Corporation
Attachment:
smime.p7s
Description: S/MIME cryptographic signature