Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Fri, 11 Nov 2011 13:27:28 +0000

On Fri, 2011-11-11 at 13:58 +0100, Joerg Roedel wrote:
> For AMD IOMMU there is a feature called not-present cache. It says that
> the IOMMU caches non-present entries as well and needs an IOTLB flush
> when something is mapped (meant for software implementations of the
> IOMMU).
> So it can't be really taken out of the fast-path. But the IOMMU driver
> can optimize the function so that it only flushes the IOTLB when there
> was an unmap-call before. 

We have exactly the same situation with the Intel IOMMU (we call it
'Caching Mode') for the same reasons.

I'd be wary about making the IOMMU driver *track* whether there was an
unmap call before — that seems like hard work and more cache contention,
especially if the ->commit() call happens on a CPU other than the one
that just did the unmap.

I'm also not sure exactly when you'd call the ->commit() function when
the DMA API is being used, and which 'side' of that API the
deferred-flush optimisations would live.

Would the optimisation be done on the generic side, only calling
->commit when it absolutely *has* to happen? (Or periodically after
unmaps have happened to avoid entries hanging around for ever?)

Or would the optimisation be done in the IOMMU driver, thus turning the
->commit() function into more of a *hint*? You could add a 'simon_says'
boolean argument to it, I suppose...?

> It is also an improvement over the current
> situation where every iommu_unmap call results in a flush implicitly.
> This pretty much a no-go for using IOMMU-API in DMA mapping at the
> moment.

Right. That definitely needs to be handled. We just need to work out the
(above and other) details.

> > But also, it's not *so* much of an issue to divide the space up even
> > when it's limited. The idea was not to have it *strictly* per-CPU, but
> > just for a CPU to try allocating from "its own" subrange first…
> 
> Yeah, I get the idea. I fear that the memory consumption will get pretty
> high with that approach. It basically means one round-robin allocator
> per cpu and device. What does that mean on a 4096 CPU machine :)

Well, if your network device is taking interrupts, and mapping/unmapping
buffers across all 4096 CPUs, then your performance is screwed anyway :)

Certainly your concerns are valid, but I think we can cope with them
fairly reasonably. If we *do* have large number of CPUs allocating for a
given domain, we can move to a per-node rather than per-CPU allocator.
And we can have dynamically sized allocation regions, so we aren't
wasting too much space on unused bitmaps if you map just *one* page from
each of your 4096 CPUs.

> How much lock contention will be lowered also depends on the work-load.
> If dma-handles are frequently freed from another cpu than they were
> allocated from the same problem re-appears.

The idea is that dma handles are *infrequently* freed, in batches. So
we'll bounce the lock's cache line occasionally, but not all the time.

In "strict" or "unmap_flush" mode, you get to go slowly unless you do
the unmap on the same CPU that you mapped it from. I can live with that.

> But in the end we have to try it out and see what works best :)

Indeed. I'm just trying to work out if I should try to do the allocator
thing purely inside the Intel code first, and then try to move it out
and make it generic — or if I should start with making the DMA API work
with a wrapper around the IOMMU API, with your ->commit() and other
necessary changes. I think I'd prefer the latter, if we can work out how
it should look.

-- 
dwmw2
Attachment:
smime.p7s

Description: S/MIME cryptographic signature