On Tue, May 16, 2023 at 08:39:42AM +0200, Petr Tesařík wrote: > On Tue, 16 May 2023 08:13:09 +0200 > Christoph Hellwig <hch@xxxxxx> wrote: > > On Mon, May 15, 2023 at 07:43:52PM +0000, Michael Kelley (LINUX) wrote: > > > FWIW, I don't think the approach you have implemented here will be > > > practical to use for CoCo VMs (SEV, TDX, whatever else). The problem > > > is that dma_direct_alloc_pages() and dma_direct_free_pages() must > > > call dma_set_decrypted() and dma_set_encrypted(), respectively. In CoCo > > > VMs, these calls are expensive because they require a hypercall to the host, > > > and the operation on the host isn't trivial either. I haven't measured the > > > overhead, but doing a hypercall on every DMA map operation and on > > > every unmap operation has long been something we thought we must > > > avoid. The fixed swiotlb bounce buffer space solves this problem by > > > doing set_decrypted() in batch at boot time, and never > > > doing set_encrypted(). > > > > I also suspect it doesn't really scale too well due to the number of > > allocations. I suspect a better way to implement things would be to > > add more large chunks that are used just like the main swiotlb buffers. > > > > That is when we run out of space try to allocate another chunk of the > > same size in the background, similar to what we do with the pool in > > dma-pool.c. This means we'll do a fairly large allocation, so we'll > > need compaction or even CMA to back it up, but the other big upside > > is that it also reduces the number of buffers that need to be checked > > in is_swiotlb_buffer or the free / sync side. > > I have considered this approach. The two main issues I ran into were: > > 1. MAX_ORDER allocations were too small (at least with 4K pages), and > even then they would often fail. > > 2. Allocating from CMA did work but only from process context. > I made a stab at modifying the CMA allocator to work from interrupt > context, but there are non-trivial interactions with the buddy > allocator. Making them safe from interrupt context looked like a > major task. Can you kick off a worker thread when the swiotlb allocation gets past some reserve limit? It still has a risk of failing to bounce until the swiotlb buffer is extended. > I also had some fears about the length of the dynamic buffer list. I > observed maximum length for block devices, and then it roughly followed > the queue depth. Walking a few hundred buffers was still fast enough. > I admit the list length may become an issue with high-end NVMe and > I/O-intensive applications. You could replace the list with an rbtree, O(log n) look-up vs O(n), could be faster if you have many bounces active. -- Catalin