On Fri, 2011-11-11 at 13:58 +0100, Joerg Roedel wrote: > For AMD IOMMU there is a feature called not-present cache. It says that > the IOMMU caches non-present entries as well and needs an IOTLB flush > when something is mapped (meant for software implementations of the > IOMMU). > So it can't be really taken out of the fast-path. But the IOMMU driver > can optimize the function so that it only flushes the IOTLB when there > was an unmap-call before. We have exactly the same situation with the Intel IOMMU (we call it 'Caching Mode') for the same reasons. I'd be wary about making the IOMMU driver *track* whether there was an unmap call before — that seems like hard work and more cache contention, especially if the ->commit() call happens on a CPU other than the one that just did the unmap. I'm also not sure exactly when you'd call the ->commit() function when the DMA API is being used, and which 'side' of that API the deferred-flush optimisations would live. Would the optimisation be done on the generic side, only calling ->commit when it absolutely *has* to happen? (Or periodically after unmaps have happened to avoid entries hanging around for ever?) Or would the optimisation be done in the IOMMU driver, thus turning the ->commit() function into more of a *hint*? You could add a 'simon_says' boolean argument to it, I suppose...? > It is also an improvement over the current > situation where every iommu_unmap call results in a flush implicitly. > This pretty much a no-go for using IOMMU-API in DMA mapping at the > moment. Right. That definitely needs to be handled. We just need to work out the (above and other) details. > > But also, it's not *so* much of an issue to divide the space up even > > when it's limited. The idea was not to have it *strictly* per-CPU, but > > just for a CPU to try allocating from "its own" subrange first… > > Yeah, I get the idea. I fear that the memory consumption will get pretty > high with that approach. It basically means one round-robin allocator > per cpu and device. What does that mean on a 4096 CPU machine :) Well, if your network device is taking interrupts, and mapping/unmapping buffers across all 4096 CPUs, then your performance is screwed anyway :) Certainly your concerns are valid, but I think we can cope with them fairly reasonably. If we *do* have large number of CPUs allocating for a given domain, we can move to a per-node rather than per-CPU allocator. And we can have dynamically sized allocation regions, so we aren't wasting too much space on unused bitmaps if you map just *one* page from each of your 4096 CPUs. > How much lock contention will be lowered also depends on the work-load. > If dma-handles are frequently freed from another cpu than they were > allocated from the same problem re-appears. The idea is that dma handles are *infrequently* freed, in batches. So we'll bounce the lock's cache line occasionally, but not all the time. In "strict" or "unmap_flush" mode, you get to go slowly unless you do the unmap on the same CPU that you mapped it from. I can live with that. > But in the end we have to try it out and see what works best :) Indeed. I'm just trying to work out if I should try to do the allocator thing purely inside the Intel code first, and then try to move it out and make it generic — or if I should start with making the DMA API work with a wrapper around the IOMMU API, with your ->commit() and other necessary changes. I think I'd prefer the latter, if we can work out how it should look. -- dwmw2
Attachment:
smime.p7s
Description: S/MIME cryptographic signature