On Tue, Jan 10, 2023, at 08:01, Christoph Hellwig wrote: > On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote: >> I had another look at the arm64 side, which (like the zicbom >> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE), >> as that has changed not that long ago, see >> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572 > > which IIRC has been reverted recently. To clarify: I was looking at arch_sync_dma_for_device(), which changed from 'invalidate' to 'clean' last June in commit c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers at start of DMA transfer"). I don't see a revert for that. The one that was reverted recently is arch_dma_prep_coherent, which was changed and reverted in c44094eee32 Aug 23 2022 flush->clean b7d9aae4048 Dec 6 2022 clean->flush I'm primarily interested in the streaming mappings (arch_sync_*) at the moment. >> I'm still not sure what the correct set of operations has >> to be, but nothing in that patch description sounds ISA >> or even microarchitecture specific. > > Nothing is ISA specific, and the only micro architecture related thing > is if the specific core can speculate memory accesses. See the table > in arch/arc/mm/dma.c for details. > > And as commented on the arm64 patch I really hate architectures getting > creative here, as I'd much prefer to move the choice of primitives to > the core DMA code and just provide helpers to invalidate/writeback/ > writeback+invalidate from the architectures eventually. Agreed, that's why I particularly don't like the idea of giving SoC specific L2$ drivers the option of making custom choices here. The arch_dma_prep_coherent() change is arguably arm64 specific because it is only needed for machines sharing memory with EL3 firmware that enforces buffer ownership, but even for that it would be nice to have a central definition and some architecture specific flag that picks one behavior or the other. Same thing for the speculative access difference. I looked at all the implementations now and put them in a table[1] to see what the differences are. The only bit that I think needs discussion is the dma_sync_single_for_device(DMA_FROM_DEVICE) op that I mentioned above. I see that arm64, csky, powerpc, riscv and parisc all write out at least partical cache lines first to avoid losing dirty data in the part that is not written by the device[2][3], while the other ones don't[4]. The other differences I found are: - arm and arm64 use clean instead of flush for dma_sync_single_for_device(DMA_BIDIRECTIONAL). This seems harmless, since there is another invalidation at the _for_cpu() step. - hexagon, m68k, openrisc and sh don't deal with speculative loads - m68k, nios2, openrisc, parisc and xtensa use flush instead of clean for DMA_TO_DEVICE, presumably because they don't have a flush without invalidate. - microblaze, parisc and powerpc use the same function for _for_device and _for_cpu, which is slightly wasteful but harmless. - openrisc lacks support for bidirectional mappings, which is a bug - sparc32 and xtensa only supports writethrough caches and consequently skips the clean before DMA but instead invalidate the buffer in the _for_cpu operation. I also see that at least arc, arm, mips and riscv all want CPU specific cache management operations to be registered at boot time. While Russell had some concerns about your suggestion to generalize the arm version, we could start by moving the abstracted riscv version into kernel/dma/direct.c and make sure it can be shared with at least mips and arc, provided that we can agree on the DMA_FROM_DEVICE behavior. Arnd [1] https://docs.google.com/spreadsheets/d/1qDuMqB6TnRTj_CgUwgIIm_RJ6EZO76qohpTJUMQjUEo/edit#gid=0 [2] https://git.kernel.org/torvalds/c/c50f11c6196f [3] https://git.kernel.org/torvalds/c/03d70617b8a7 [4] https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/