On Thu, Apr 21, 2022 at 12:20:22AM -0700, Christoph Hellwig wrote: > Btw, there is another option: Most real systems already require having > swiotlb to bounce buffer in some cases. We could simply force bounce > buffering in the dma mapping code for too small or not properly aligned > transfers and just decrease the dma alignment. We can force bounce if size is small but checking the alignment is trickier. Normally the beginning of the buffer is aligned but the end is at some sizeof() distance. We need to know whether the end is in a kmalloc-128 cache and that requires reaching out to the slab internals. That's doable and not expensive but it needs to be done for every small size getting to the DMA API, something like (for mm/slub.c): folio = virt_to_folio(x); slab = folio_slab(folio); if (slab->slab_cache->align < ARCH_DMA_MINALIGN) ... bounce ... (and a bit different for mm/slab.c) If we scrap ARCH_DMA_MINALIGN altogether from arm64, we can check the alignment against cache_line_size(), though I'd rather keep it for code that wants to avoid bouncing and goes for this compile-time alignment. I think we are down to four options (1 and 2 can be combined): 1. ARCH_DMA_MINALIGN == 128, dynamic arch_kmalloc_minalign() to reduce kmalloc() alignment to 64 on most arm64 SoC - this series. 2. ARCH_DMA_MINALIGN == 128, ARCH_KMALLOC_MINALIGN == 128, add explicit __GFP_PACKED for small allocations. It can be combined with (1) so that allocations without __GFP_PACKED can still get 64-byte alignment. 3. ARCH_DMA_MINALIGN == 128, ARCH_KMALLOC_MINALIGN == 8, swiotlb bounce. 4. undef ARCH_DMA_MINALIGN, ARCH_KMALLOC_MINALIGN == 8, swiotlb bounce. (3) and (4) don't require histogram analysis. Between them, I have a preference for (3) as it gives drivers a chance to avoid the bounce. If (2) is feasible, we don't need to bother with any bouncing or structure alignments, it's an opt-in by the driver/subsystem. However, it may be tedious to analyse the hot spots. While there are a few obvious places (kstrdup), I don't have access to a multitude of devices that may exercise the drivers and subsystems. With (3) the risk is someone complaining about performance or even running out of swiotlb space on some SoCs (I guess the fall-back can be another kmalloc() with an appropriate size). I guess we can limit the choice to either (2) or (3). I have (2) already (needs some more testing). I can attempt (3) and try to run it on some real hardware to see the perf impact. -- Catalin