On Wed, Apr 13, 2022 at 09:53:24AM -1000, Linus Torvalds wrote: > On Tue, Apr 12, 2022 at 10:47 PM Catalin Marinas > <catalin.marinas@xxxxxxx> wrote: > > I agree. There is also an implicit expectation that the DMA API works on > > kmalloc'ed buffers and that's what ARCH_DMA_MINALIGN is for (and the > > dynamic arch_kmalloc_minalign() in this series). But the key point is > > that the driver doesn't need to know the CPU cache topology, coherency, > > the DMA API and kmalloc() take care of these. > > Honestly, I think it would probably be worth discussing the "kmalloc > DMA alignment" issues. > > 99.9% of kmalloc users don't want to do DMA. > > And there's actually a fair amount of small kmalloc for random stuff. > Right now on my laptop, I have > > kmalloc-8 16907 18432 8 512 1 : ... > > according to slabinfo, so almost 17 _thousand_ allocations of 8 bytes. > > It's all kinds of sad if those allocations need to be 64 bytes in size > just because of some silly DMA alignment issue, when none of them want > it. It's a lot worse, ARCH_KMALLOC_MINALIGN is currently 128 bytes on arm64. I want to at least get it down to 64 with this series while preserving the current kmalloc() semantics. If we know the SoC is fully coherent (a bit tricky with late probed devices), we could get the alignment down to 8. In the mobile space, unfortunately, most DMA is non-coherent. I think it's worth investigating the __dma annotations that Greg suggested, though I have a suspicion it either is too difficult to track or we just end up with this annotation everywhere. There are cases where the memory is allocated outside the driver that knows the DMA needs, though I guess these are either full page allocations or kmem_cache_alloc() (e.g. page cache pages, skb). There's also Ard's suggestion to bounce the (inbound DMA) buffer if not aligned. That's doable but dma_map_single(), for example, only gets the size of some random structure/buffer. If the size is below ARCH_DMA_MINALIGN (or cache_line_size()), the DMA API implementation would have to retrieve the slab cache, check the real allocation size and then bounce if necessary. Irrespective of which option we go for, I think at least part of this series decoupling ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN is still needed since currently the minalign is used in some compile time attributes. Even getting the kmalloc() size down to 64 is a significant improvement over 128. Subsequently I'd attempt Ard's bouncing idea as a quick workaround and assess the bouncing overhead on some real platforms. This may be needed before we track down all places to use dma_kmalloc(). I need to think some more on Greg's __dma annotation, as I said the allocation may be decoupled from the driver in some cases. -- Catalin