On Thu, 14 Apr 2022 at 17:01, Ard Biesheuvel <ardb@xxxxxxxxxx> wrote: > > On Thu, 14 Apr 2022 at 16:53, Greg Kroah-Hartman > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > > > > On Thu, Apr 14, 2022 at 04:36:46PM +0200, Ard Biesheuvel wrote: > > > On Thu, 14 Apr 2022 at 16:27, Greg Kroah-Hartman > > > <gregkh@xxxxxxxxxxxxxxxxxxx> wrote: > > > > > > > > On Thu, Apr 14, 2022 at 03:52:53PM +0200, Ard Biesheuvel wrote: > ... > > > > > What we might do, given the fact that only inbound non-cache coherent > > > > > DMA is problematic, is dropping the kmalloc alignment to 8 like on > > > > > x86, and falling back to bounce buffering when a misaligned, non-cache > > > > > coherent inbound DMA mapping is created, using the SWIOTLB bounce > > > > > buffering code that we already have, and is already in use on most > > > > > affected systems for other reasons (i.e., DMA addressing limits) > > > > > > > > Ick, that's a mess. > > > > > > > > > This will cause some performance regressions, but in a way that seems > > > > > fixable to me: taking network drivers as an example, the RX buffers > > > > > that are filled using inbound DMA are typically owned by the driver > > > > > itself, which could be updated to round up its allocations and DMA > > > > > mappings. Block devices typically operate on quantities that are > > > > > aligned sufficiently already. In other cases, we will likely notice > > > > > if/when this fallback is taken on a hot path, but if we don't, at > > > > > least we know a bounce buffer is being used whenever we cannot perform > > > > > the DMA safely in-place. > > > > > > > > We can move to having an "allocator-per-bus" for memory like this to > > > > allow the bus to know if this is a DMA requirement or not. > > > > > > > > So for all USB drivers, we would have: > > > > usb_kmalloc(size, flags); > > > > and then it might even be easier to verify with static tools that the > > > > USB drivers are sending only properly allocated data. Same for SPI and > > > > other busses. > > > > > > > > > > As I pointed out earlier in the thread, alignment/padding requirements > > > for non-coherent DMA are a property of the CPU's cache hierarchy, not > > > of the device. So I'm not sure I follow how a per-subsystem > > > distinction would help here. In the case of USB especially, would that > > > mean that block, media and networking subsystems would need to be > > > aware of the USB-ness of the underlying transport? > > > > That's what we have required today, yes. That's only because we knew > > that for some USB controllers, that was a requirement and we had no way > > of passing that information back up the stack so we just made it a > > requirement. > > > > But I do agree this is messy. It's even messier for things like USB > > where it's not the USB device itself that matters, it's the USB > > controller that the USB device is attached to. And that can be _way_ up > > the device hierarchy. Attach something like a NFS mount over a PPP > > network connection on a USB to serial device and ugh, where do you > > begin? :) > > > > Exactly. > > > And is this always just an issue of the CPU cache hierarchy? And not the > > specific bridge that a device is connected to that CPU on? Or am I > > saying the same thing here? > > > > Yes, this is a system property not a device property, and the driver > typically doesn't have any knowledge of this. For example, if a PCI > host bridge happens to be integrated in a non-cache coherent way, any > PCI device plugged into it becomes non-coherent, and the associated > driver needs to do the right thing. This is why we rely on the DMA > layer to take care of this. > > > I mean take a USB controller for example. We could have a system where > > one USB controller is on a PCI bus, while another is on a "platform" > > bus. Both of those are connected to the CPU in different ways and so > > could have different DMA rules. Do we downgrade everything in the > > system for the worst connection possible? > > > > No, we currently support a mix of coherent and non-coherent just fine, > and this shouldn't change. It's just that the mere fact that > non-coherent devices might exist is increasing the memory footprint of > all kmalloc allocations. > > > Again, consider a USB driver allocating memory to transfer stuff, should > > it somehow know the cache hierarchy that it is connected to? Right now > > we punt and do not do that at the expense of a bit of potentially > > wasted memory for small allocations. > > > > This whole discussion is based on the premise that this is an expense > we would prefer to avoid. Currently, every kmalloc allocation is > rounded up to 128 bytes on arm64, while x86 uses only 8. I guess I didn't answer that last question. Yes, I guess dma_kmalloc() should be used in such cases. Combined with my bounce buffering hack, the penalty for using plain kmalloc() instead would be a potential performance hit when used for inbound DMA, instead of data corruption (if we'd reduce the kmalloc() alignment when introducing dma_kmalloc())