On Fri, Oct 14, 2022 at 01:44:25PM -0700, Linus Torvalds wrote: > On Fri, Oct 14, 2022 at 1:24 PM Saravana Kannan <saravanak@xxxxxxxxxx> wrote: > > Agreed. Even allowing a 64-byte kmalloc cache on a system with a > > 64-byte cacheline size saves quite a bit of memory. > > Well, the *really* trivial thing to do is to just say "if the platform > is DMA coherent, just allow any size kmalloc cache". And just > consciously leave the broken garbage behind. The problem is we don't have a reliable way to tell whether the platform is DMA-coherent. The CPU IDs don't really say much and in most cases it's a property of the interconnect/bus and device. We describe the DMA coherency in DT or ACPI and the latter is somewhat better as it assumes coherent by default. But for DT, not having a 'dma-coherent' property means non-coherent DMA (or no DMA at all). We can't even tell whether the device is going to do any DMA, arch_setup_dma_ops() is called even for devices like the PMU. We could look into defining new properties (e.g. "no-dma") and adjust the DTs but we may also have late probed devices, long after the slab allocator was initialised. A big 'dma-coherent' property on the top node may work but most Android systems won't benefit from this (your laptop may, I haven't checked). I think the best bet is still either (a) bounce for small sizes or (b) a new GFP_NODMA/PACKED/etc. flag for the hot small allocations. (a) is somehow more universal but lots (most) Android devices are deployed with no swiotlb buffer as the vendor knows the device needs and don't need extra buffer waste. Not sure how reliable it would be to trigger another slab allocation on the dma_map_*() calls for the bounce (it may need to be GFP_ATOMIC). Option (b) looks more appealing on such systems, though a lot more churn throughout the kernel. -- Catalin