Re: [PATCH v9 5/6] iommu/dma: Allow a single FQ in addition to per-CPU FQs

Niklas Schnelle <schnelle@xxxxxxxxxxxxx> · Tue, 23 May 2023 15:35:57 +0200

On Mon, 2023-05-22 at 17:26 +0100, Robin Murphy wrote:
> On 2023-05-15 10:15, Niklas Schnelle wrote:
> > In some virtualized environments, including s390 paged memory guests,
> > IOTLB flushes are used to update IOMMU shadow tables. Due to this, they
> > are much more expensive than in typical bare metal environments or
> > non-paged s390 guests. In addition they may parallelize more poorly in
> > virtualized environments. This changes the trade off for flushing IOVAs
> > such that minimizing the number of IOTLB flushes trumps any benefit of
> > cheaper queuing operations or increased paralellism.
> > 
> > In this scenario per-CPU flush queues pose several problems. Firstly
> > per-CPU memory is often quite limited prohibiting larger queues.
> > Secondly collecting IOVAs per-CPU but flushing via a global timeout
> > reduces the number of IOVAs flushed for each timeout especially on s390
> > where PCI interrupts may not be bound to a specific CPU.
> > 
> > Let's introduce a single flush queue mode that reuses the same queue
> > logic but only allocates a single global queue. This mode can be
> > selected as a flag bit in a new dma_iommu_options struct which can be
> > modified from its defaults by IOMMU drivers implementing a new
> > ops.tune_dma_iommu() callback. As a first user the s390 IOMMU driver
> > selects the single queue mode if IOTLB flushes are needed on map which
> > indicates shadow table use. With the unchanged small FQ size and
> > timeouts this setting is worse than per-CPU queues but a follow up patch
> > will make the FQ size and timeout variable. Together this allows the
> > common IOVA flushing code to more closely resemble the global flush
> > behavior used on s390's previous internal DMA API implementation.
> > 
> > Link: https://lore.kernel.org/linux-iommu/3e402947-61f9-b7e8-1414-fde006257b6f@xxxxxxx/
> > Reviewed-by: Matthew Rosato <mjrosato@xxxxxxxxxxxxx> #s390
> > Signed-off-by: Niklas Schnelle <schnelle@xxxxxxxxxxxxx>
> > ---
> >   drivers/iommu/dma-iommu.c  | 163 ++++++++++++++++++++++++++++++++++-----------
> >   drivers/iommu/dma-iommu.h  |   4 +-
> >   drivers/iommu/iommu.c      |  18 +++--
> >   drivers/iommu/s390-iommu.c |  10 +++
> >   include/linux/iommu.h      |  21 ++++++
> >   5 files changed, 169 insertions(+), 47 deletions(-)
> > 
---8<---
> >   
> > +/**
> > + * struct dma_iommu_options - Options for dma-iommu
> > + *
> > + * @flags: Flag bits for enabling/disabling dma-iommu settings
> > + *
> > + * This structure is intended to provide IOMMU drivers a way to influence the
> > + * behavior of the dma-iommu DMA API implementation. This allows optimizing for
> > + * example for a virtualized environment with slow IOTLB flushes.
> > + */
> > +struct dma_iommu_options {
> > +#define IOMMU_DMA_OPTS_PER_CPU_QUEUE	(0L << 0)
> > +#define IOMMU_DMA_OPTS_SINGLE_QUEUE	(1L << 0)
> > +	u64	flags;
> > +};
> 
> I think for now this can just use a bit in dev_iommu to indicate that 
> the device will prefer a global flush queue; s390 can set that in 
> .probe_device, then iommu_dma_init_domain() can propagate it to an 
> equivalent flag in the cookie (possibly even a new cookie type?) that 
> iommu_dma_init_fq() can then consume. Then just make the s390 parameters 
> from patch #6 the standard parameters for a global queue.
> 
> Thanks,
> Robin.

Working on this now. How about I move the struct dma_iommu_options
definition into dma-iommu.c keeping it as part of struct
iommu_dma_cookie. That way we can still have the flags, timeout and
queue size organized the same but internal to dma-iommu.c. We then set
them in iommu_dma_init_domain() triggered by a "shadow_on_flush" flag
in struct dev_iommu. That way we can keep most of the same code but
only add a single flag as external interface. The flag would also be an
explicit fact about a distinctly IOMMU device thing just stating that
the IOTLB flushes do extra shadowing work. This leaves the decision to
then use a longer timeout and queue size within the responsibility of
dma-iommu.c. I think that's overall a better match of responsibilities.

Thanks,
Niklas