Re: [PATCH v2 RESEND 4/7] swiotlb: Dynamically allocated bounce buffers

Petr Tesařík <petr@xxxxxxxxxxx> · Tue, 16 May 2023 08:16:45 +0200

Hi Michael,

On Mon, 15 May 2023 19:43:52 +0000
"Michael Kelley (LINUX)" <mikelley@xxxxxxxxxxxxx> wrote:

> From: Petr Tesarik <petrtesarik@xxxxxxxxxxxxxxx> Sent: Tuesday, May 9, 2023 2:18 AM
> > 
> > The software IO TLB was designed with the assumption that it is not
> > used much, especially on 64-bit systems, so a small fixed memory
> > area (currently 64 MiB) is sufficient to handle the few cases which
> > still require a bounce buffer. However, these cases are not so rare
> > in some circumstances.
> > 
> > First, if SEV is active, all DMA must be done through shared
> > unencrypted pages, and SWIOTLB is used to make this happen without
> > changing device drivers. The software IO TLB size is increased to 6%
> > of total memory in sev_setup_arch(), but that is more of an
> > approximation. The actual requirements may vary depending on which
> > drivers are used and the amount of I/O.  
> 
> FWIW, I don't think the approach you have implemented here will be
> practical to use for CoCo VMs (SEV, TDX, whatever else).  The problem
> is that dma_direct_alloc_pages() and dma_direct_free_pages() must
> call dma_set_decrypted() and dma_set_encrypted(), respectively.  In CoCo
> VMs, these calls are expensive because they require a hypercall to the host,
> and the operation on the host isn't trivial either.  I haven't measured the
> overhead, but doing a hypercall on every DMA map operation and on
> every unmap operation has long been something we thought we must
> avoid.  The fixed swiotlb bounce buffer space solves this problem by
> doing set_decrypted() in batch at boot time, and never
> doing set_encrypted().

I know. For performance, alloc() and free() on each DMA map/unmap is
not ideal even without CoCo. I have already tested some code locally
to keep buffers around after unmap, effectively creating a per-device
pool as described below. Right now, I don't have SEV-capable hardware
for testing, but on bare metal, this pool brought performance back to
that of fixed swiotlb buffers, for some scenarios making it even better.

However, these per-device pools add more complexity, so I decided to
start with a smaller patch series that can be reviewed more easily. If
there is no objection in general, I'll send the second part with the
per-device pools.

> In Microsoft's first implementation of bounce buffering for SEV-SNP VMs,
> we created custom bounce buffer code separate from swiotlb.  This code
> did similar what you've done, but maintained a per-device pool of allocated
> buffers that could be reused, rather than freeing the memory (and marking
> the memory encrypted again) on every DMA unmap operation.  (The pool
> was actually per-VMBus channel, but VMBus channels are per-device, so
> the effect was the same.)  The reusable pool avoided most of the calls to
> set_decrypted()/set_encrypted() and made it practical from a performance
> standpoint.  But of course, the pool could grow arbitrarily large, so there
> was additional complexity to decay and trim the pool size.  LKML feedback
> early on was to use swiotlb instead, which made sense, but at the cost of
> needing to figure out the appropriate fixed size of the swiotlb, and likely
> over-provisioning to avoid running out of bounce buffer space.
> 
> Now we're considering again a more dynamic approach, which is good, but
> we're encountering the same problems.
> 
> See https://lore.kernel.org/linux-hyperv/20210228150315.2552437-1-ltykernel@xxxxxxxxx/
> for this historical example.

Thanks for the pointer. I'll definitely have a look!

Petr T

> Michael
> 
> > 
> > Second, some embedded devices have very little RAM, so 64 MiB is not
> > negligible. Sadly, these are exactly the devices that also often
> > need a software IO TLB. Although minimum swiotlb size can be found
> > empirically by extensive testing, it would be easier to allocate a
> > small swiotlb at boot and let it grow on demand.
> > 
> > Growing the SWIOTLB data structures at run time is impossible. The
> > whole SWIOTLB region is contiguous in physical memory to allow
> > combining adjacent slots and also to ensure that alignment
> > constraints can be met. The SWIOTLB is too big for the buddy
> > allocator (cf. MAX_ORDER). More importantly, even if a new SWIOTLB
> > could be allocated (e.g. from CMA), it cannot be extended in-place
> > (because surrounding pages may be already allocated for other
> > purposes), and there is no mechanism for relocating already mapped
> > bounce buffers: The DMA API gets only the address of a buffer, and
> > the implementation (direct or IOMMU) checks whether it belongs to
> > the software IO TLB.
> > 
> > It is possible to allocate multiple smaller struct io_tlb_mem
> > instances. However, they would have to be stored in a non-constant
> > container (list or tree), which needs synchronization between
> > readers and writers, creating contention in a hot path for all
> > devices, not only those which need software IO TLB.
> > 
> > Another option is to allocate a very large SWIOTLB at boot, but
> > allow migrating pages to other users (like CMA does). This approach
> > might work, but there are many open issues:
> > 
> > 1. After a page is migrated away from SWIOTLB, it must not be used
> >    as a (direct) DMA buffer. Otherwise SWIOTLB code would have to
> >    check which pages have been migrated to determine whether a given
> >    buffer address belongs to a bounce buffer or not, effectively
> >    introducing all the issues of multiple SWIOTLB instances.
> > 
> > 2. Unlike SWIOTLB, CMA cannot be used from atomic contexts, and that
> >    for many different reasons. This might be changed in theory, but
> >    it would take a lot of investigation and time. OTOH improvement
> >    to the SWIOTLB is needed now.
> > 
> > 3. If SWIOTLB is implemented separately from CMA and not as its
> >    part, users have to solve the dilemma of how to distribute
> >    precious DMA-able memory between them.
> > 
> > The present patch is a simplistic solution. Bounce buffers are
> > allocated using the non-coherent DMA API, removing the need to
> > implement a new custom allocator. These buffers are then tracked in
> > a per-device linked list, reducing the impact on devices that do not
> > need the SWIOTLB.
> > 
> > Analysis of real-world I/O patterns has shown that the same buffer
> > is typically looked up repeatedly (for further sync operations, or
> > to be unmapped). The most recently used bounce buffer is therefore
> > always moved to the beginning of the list. The list performed better
> > than a maple tree when tested with fio against a QEMU SATA drive
> > backed by a RAM block device in the host (4 cores, 16 iodepth).
> > Other scenarios are also likely to benefit from this MRU order but
> > have not been tested.
> > 
> > Operations on the list are serialized with a spinlock. It is
> > unfortunately not possible to use an RCU list, because quiescent
> > state is not guaranteed to happen before an asynchronous event (e.g.
> > hardware interrupt) on another CPU. If that CPU used an old version
> > of the list, it would fail to copy data to and/or from the newly
> > allocated bounce buffer.
> > 
> > Last but not least, bounce buffers are never allocated dynamically
> > if the SWIOTLB is in fact a DMA restricted pool, because that would
> > defeat the purpose of a restricted pool.
> > 
> > Signed-off-by: Petr Tesarik <petr.tesarik.ext@xxxxxxxxxx>
> > ---
> >  include/linux/device.h  |   8 ++
> >  include/linux/swiotlb.h |   8 +-
> >  kernel/dma/swiotlb.c    | 252 ++++++++++++++++++++++++++++++++++++++--
> >  3 files changed, 259 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index 472dd24d4823..d1d2b8557b30 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -510,6 +510,12 @@ struct device_physical_location {
> >   * @dma_mem:	Internal for coherent mem override.
> >   * @cma_area:	Contiguous memory area for dma allocations
> >   * @dma_io_tlb_mem: Pointer to the swiotlb pool used.  Not for driver use.
> > + * @dma_io_tlb_dyn_lock:
> > + *		Spinlock to protect the list of dynamically allocated bounce
> > + *		buffers.
> > + * @dma_io_tlb_dyn_slots:
> > + *		Dynamically allocated bounce buffers for this device.
> > + *		Not for driver use.
> >   * @archdata:	For arch-specific additions.
> >   * @of_node:	Associated device tree node.
> >   * @fwnode:	Associated device node supplied by platform firmware.
> > @@ -615,6 +621,8 @@ struct device {
> >  #endif
> >  #ifdef CONFIG_SWIOTLB
> >  	struct io_tlb_mem *dma_io_tlb_mem;
> > +	spinlock_t dma_io_tlb_dyn_lock;
> > +	struct list_head dma_io_tlb_dyn_slots;
> >  #endif
> >  	/* arch specific additions */
> >  	struct dev_archdata	archdata;
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 281ecc6b9bcc..6aada6ac31e2 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -114,6 +114,8 @@ struct io_tlb_mem {
> >  };
> >  extern struct io_tlb_mem io_tlb_default_mem;
> > 
> > +bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr);
> > +
> >  /**
> >   * is_swiotlb_fixed() - check if a physical address belongs to a swiotlb slot
> >   * @mem:	relevant swiotlb pool
> > @@ -147,7 +149,9 @@ static inline bool is_swiotlb_buffer(struct device *dev,
> > phys_addr_t paddr)
> >  {
> >  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> > 
> > -	return mem && is_swiotlb_fixed(mem, paddr);
> > +	return mem &&
> > +		(is_swiotlb_fixed(mem, paddr) ||
> > +		 is_swiotlb_dyn(dev, paddr));
> >  }
> > 
> >  static inline bool is_swiotlb_force_bounce(struct device *dev)
> > @@ -164,6 +168,8 @@ static inline bool is_swiotlb_force_bounce(struct device *dev)
> >  static inline void swiotlb_dev_init(struct device *dev)
> >  {
> >  	dev->dma_io_tlb_mem = &io_tlb_default_mem;
> > +	spin_lock_init(&dev->dma_io_tlb_dyn_lock);
> > +	INIT_LIST_HEAD(&dev->dma_io_tlb_dyn_slots);
> >  }
> > 
> >  void swiotlb_init(bool addressing_limited, unsigned int flags);
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index 96ba93be6772..612e1c2e9573 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -68,6 +68,22 @@ struct io_tlb_slot {
> >  	unsigned int list;
> >  };
> > 
> > +/**
> > + * struct io_tlb_dyn_slot - dynamically allocated swiotlb slot
> > + * @node:	node in the per-device list
> > + * @orig_addr:	physical address of the original DMA buffer
> > + * @alloc_size:	total size of the DMA buffer
> > + * @page:	first page of the bounce buffer
> > + * @dma_addr:	DMA address of the bounce buffer
> > + */
> > +struct io_tlb_dyn_slot {
> > +	struct list_head node;
> > +	phys_addr_t orig_addr;
> > +	size_t alloc_size;
> > +	struct page *page;
> > +	dma_addr_t dma_addr;
> > +};
> > +
> >  static bool swiotlb_force_bounce;
> >  static bool swiotlb_force_disable;
> > 
> > @@ -466,6 +482,191 @@ void __init swiotlb_exit(void)
> >  	memset(mem, 0, sizeof(*mem));
> >  }
> > 
> > +/**
> > + * lookup_dyn_slot_locked() - look up a dynamically allocated bounce buffer
> > + * @dev:	device which has mapped the buffer
> > + * @paddr:	physical address within the bounce buffer
> > + *
> > + * Walk through the list of bounce buffers dynamically allocated for @dev,
> > + * looking for a buffer which contains @paddr.
> > + *
> > + * Context: Any context. Expects dma_io_tlb_dyn_lock lock to be held.
> > + * Return:
> > + *   Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
> > + */
> > +static struct io_tlb_dyn_slot *lookup_dyn_slot_locked(struct device *dev,
> > +						      phys_addr_t paddr)
> > +{
> > +	struct io_tlb_dyn_slot *slot;
> > +
> > +	list_for_each_entry(slot, &dev->dma_io_tlb_dyn_slots, node) {
> > +		phys_addr_t start = page_to_phys(slot->page);
> > +		phys_addr_t end = start + slot->alloc_size - 1;
> > +
> > +		if (start <= paddr && paddr <= end)
> > +			return slot;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/**
> > + * lookup_dyn_slot() - look up a dynamically allocated bounce buffer
> > + * @dev:	device which has mapped the buffer
> > + * @paddr:	physical address within the bounce buffer
> > + *
> > + * Search for a dynamically allocated bounce buffer which contains
> > + * @paddr. If found, the buffer is moved to the beginning of the linked
> > + * list. Use lookup_dyn_slot_locked() directly where this behavior is not
> > + * required/desired.
> > + *
> > + * Context: Any context. Takes and releases dma_io_tlb_dyn_lock.
> > + * Return:
> > + *   Address of a &struct io_tlb_dyn_slot, or %NULL if not found.
> > + */
> > +static struct io_tlb_dyn_slot *lookup_dyn_slot(struct device *dev,
> > +					       phys_addr_t paddr)
> > +{
> > +	struct io_tlb_dyn_slot *slot;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > +	slot = lookup_dyn_slot_locked(dev, paddr);
> > +	list_move(&slot->node, &dev->dma_io_tlb_dyn_slots);
> > +	spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > +	return slot;
> > +}
> > +
> > +/**
> > + * is_swiotlb_dyn() - check if a physical address belongs to a dynamically
> > + *                    allocated bounce buffer
> > + * @dev:	device which has mapped the buffer
> > + * @paddr:	physical address within the bounce buffer
> > + *
> > + * Check whether there is a dynamically allocated bounce buffer which
> > + * contains @paddr. If found, the buffer is moved to the beginning of
> > + * the MRU list.
> > + *
> > + * Return:
> > + * * %true if @paddr points into a dynamically allocated bounce buffer
> > + * * %false otherwise
> > + */
> > +bool is_swiotlb_dyn(struct device *dev, phys_addr_t paddr)
> > +{
> > +	return !!lookup_dyn_slot(dev, paddr);
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_bounce() - copy a dynamically allocated buffer from or back
> > + *                        to the original dma location
> > + * @dev:	device which has mapped the buffer
> > + * @tlb_addr:	physical address inside the bounce buffer
> > + * @size:	size of the region to be copied
> > + * @dir:	direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + * This function works only for dynamically allocated bounce buffers.
> > + */
> > +static void swiotlb_dyn_bounce(struct device *dev, phys_addr_t tlb_addr,
> > +		size_t size, enum dma_data_direction dir)
> > +{
> > +	struct io_tlb_dyn_slot *slot = lookup_dyn_slot(dev, tlb_addr);
> > +	unsigned int tlb_offset;
> > +	unsigned char *vaddr;
> > +
> > +	if (!slot)
> > +		return;
> > +
> > +	tlb_offset = tlb_addr - page_to_phys(slot->page);
> > +	vaddr = page_address(slot->page) + tlb_offset;
> > +
> > +	swiotlb_copy(dev, slot->orig_addr, vaddr, size, slot->alloc_size,
> > +		     tlb_offset, dir);
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_map() - allocate a bounce buffer dynamically
> > + * @dev:	device which maps the buffer
> > + * @orig_addr:	address of the original buffer
> > + * @alloc_size:	total size of the original buffer
> > + * @alloc_align_mask:
> > + *		required physical alignment of the I/O buffer
> > + * @dir:	direction of the data transfer
> > + * @attrs:	optional DMA attributes for the map operation
> > + *
> > + * Allocate a new bounce buffer using DMA non-coherent API. This function
> > + * assumes that there is a fallback allocation scheme if the allocation
> > + * fails. In fact, it always fails for buffers smaller than a page and
> > + * for address constraints that are not (yet) correctly handled by
> > + * dma_direct_alloc_pages().
> > + *
> > + * Return: Physical address of the bounce buffer, or %DMA_MAPPING_ERROR.
> > + */
> > +static phys_addr_t swiotlb_dyn_map(struct device *dev, phys_addr_t orig_addr,
> > +		size_t alloc_size, unsigned int alloc_align_mask,
> > +		enum dma_data_direction dir, unsigned long attrs)
> > +{
> > +	struct io_tlb_dyn_slot *slot;
> > +	unsigned long flags;
> > +	gfp_t gfp;
> > +
> > +	/* Allocation has page granularity. Avoid small buffers. */
> > +	if (alloc_size < PAGE_SIZE)
> > +		goto err;
> > +
> > +	/* DMA direct does not deal with physical address constraints. */
> > +	if (alloc_align_mask || dma_get_min_align_mask(dev))
> > +		goto err;
> > +
> > +	gfp = (attrs & DMA_ATTR_MAY_SLEEP) ? GFP_NOIO : GFP_NOWAIT;
> > +	slot = kmalloc(sizeof(*slot), gfp | __GFP_NOWARN);
> > +	if (!slot)
> > +		goto err;
> > +
> > +	slot->orig_addr = orig_addr;
> > +	slot->alloc_size = alloc_size;
> > +	slot->page = dma_direct_alloc_pages(dev, PAGE_ALIGN(alloc_size),
> > +					    &slot->dma_addr, dir,
> > +					    gfp | __GFP_NOWARN);
> > +	if (!slot->page)
> > +		goto err_free_slot;
> > +
> > +	spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > +	list_add(&slot->node, &dev->dma_io_tlb_dyn_slots);
> > +	spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > +
> > +	return page_to_phys(slot->page);
> > +
> > +err_free_slot:
> > +	kfree(slot);
> > +err:
> > +	return (phys_addr_t)DMA_MAPPING_ERROR;
> > +}
> > +
> > +/**
> > + * swiotlb_dyn_unmap() - unmap a dynamically allocated bounce buffer
> > + * @dev:	device which mapped the buffer
> > + * @tlb_addr:	physical address of the bounce buffer
> > + * @dir:	direction of the data transfer
> > + *
> > + * Release all resources associated with a dynamically allocated bounce
> > + * buffer.
> > + */
> > +static void swiotlb_dyn_unmap(struct device *dev, phys_addr_t tlb_addr,
> > +			      enum dma_data_direction dir)
> > +{
> > +	struct io_tlb_dyn_slot *slot;
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&dev->dma_io_tlb_dyn_lock, flags);
> > +	slot = lookup_dyn_slot_locked(dev, tlb_addr);
> > +	list_del(&slot->node);
> > +	spin_unlock_irqrestore(&dev->dma_io_tlb_dyn_lock, flags);
> > +
> > +	dma_direct_free_pages(dev, slot->alloc_size, slot->page,
> > +			      slot->dma_addr, dir);
> > +	kfree(slot);
> > +}
> > +
> >  /*
> >   * Return the offset into a iotlb slot required to keep the device happy.
> >   */
> > @@ -474,11 +675,19 @@ static unsigned int swiotlb_align_offset(struct device *dev,
> > u64 addr)
> >  	return addr & dma_get_min_align_mask(dev) & (IO_TLB_SIZE - 1);
> >  }
> > 
> > -/*
> > - * Bounce: copy the swiotlb buffer from or back to the original dma location
> > +/**
> > + * swiotlb_fixed_bounce() - copy a fixed-slot swiotlb buffer from or back
> > + *                          to the original dma location
> > + * @dev:	device which has mapped the buffer
> > + * @tlb_addr:	physical address inside the bounce buffer
> > + * @size:	size of the region to be copied
> > + * @dir:	direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + * This function works only for fixed bounce buffers.
> >   */
> > -static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> > -			   enum dma_data_direction dir)
> > +static void swiotlb_fixed_bounce(struct device *dev, phys_addr_t tlb_addr,
> > +				 size_t size, enum dma_data_direction dir)
> >  {
> >  	struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> >  	int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
> > @@ -574,6 +783,25 @@ static void swiotlb_copy(struct device *dev, phys_addr_t
> > orig_addr,
> >  	}
> >  }
> > 
> > +/**
> > + * swiotlb_bounce() - copy the swiotlb buffer from or back to the original
> > + * dma location
> > + * @dev:	device which has mapped the buffer
> > + * @tlb_addr:	physical address inside the bounce buffer
> > + * @size:	size of the region to be copied
> > + * @dir:	direction of the data transfer
> > + *
> > + * Copy data to or from a buffer of @size bytes starting at @tlb_addr.
> > + */
> > +static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size,
> > +			   enum dma_data_direction dir)
> > +{
> > +	if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> > +		swiotlb_fixed_bounce(dev, tlb_addr, size, dir);
> > +	else
> > +		swiotlb_dyn_bounce(dev, tlb_addr, size, dir);
> > +}
> > +
> >  static inline phys_addr_t slot_addr(phys_addr_t start, phys_addr_t idx)
> >  {
> >  	return start + (idx << IO_TLB_SHIFT);
> > @@ -834,8 +1062,13 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev,
> > phys_addr_t orig_addr,
> >  		return (phys_addr_t)DMA_MAPPING_ERROR;
> >  	}
> > 
> > -	tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
> > -				     alloc_align_mask, attrs);
> > +	tlb_addr = (phys_addr_t)DMA_MAPPING_ERROR;
> > +	if (!is_swiotlb_for_alloc(dev))
> > +		tlb_addr = swiotlb_dyn_map(dev, orig_addr, alloc_size,
> > +					   alloc_align_mask, dir, attrs);
> > +	if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR)
> > +		tlb_addr = swiotlb_fixed_map(dev, orig_addr, alloc_size,
> > +					     alloc_align_mask, attrs);
> > 
> >  	if (tlb_addr == (phys_addr_t)DMA_MAPPING_ERROR) {
> >  		if (!(attrs & DMA_ATTR_NO_WARN))
> > @@ -919,7 +1152,10 @@ void swiotlb_tbl_unmap_single(struct device *dev,
> > phys_addr_t tlb_addr,
> >  	    (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
> >  		swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
> > 
> > -	swiotlb_release_slots(dev, tlb_addr);
> > +	if (is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> > +		swiotlb_release_slots(dev, tlb_addr);
> > +	else
> > +		swiotlb_dyn_unmap(dev, tlb_addr, dir);
> >  }
> > 
> >  void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
> > @@ -1089,7 +1325,7 @@ bool swiotlb_free(struct device *dev, struct page *page,
> > size_t size)
> >  {
> >  	phys_addr_t tlb_addr = page_to_phys(page);
> > 
> > -	if (!is_swiotlb_buffer(dev, tlb_addr))
> > +	if (!is_swiotlb_fixed(dev->dma_io_tlb_mem, tlb_addr))
> >  		return false;
> > 
> >  	swiotlb_release_slots(dev, tlb_addr);
> > --
> > 2.25.1  
>