[CC'ing a bunch of arch maintainers who are not "obviously" immune to this issue; please can you check your archs?] Hi folks, Some off-list discussions about the arm64 implementation of arch_sync_dma_for_device() have left me unsure about the following architectures in this area: arc arm arm64 hexagon m68k microblaze nios2 openrisc powerpc sh The background is that when using the streaming DMA API to map a buffer prior to inbound non-coherent DMA (i.e. DMA_FROM_DEVICE), it is often necessary to remove any dirty CPU cache lines so that they can't get evicted during the transfer and corrupt the buffer contents written by the DMA. The architectures listed above appear to achieve this using *invalidation*, which tends to imply discarding of the CPU cache lines rather than writing them back, and this consequently raises two related questions: (1) What if the DMA transfer doesn't write to every byte in the buffer? (2) What if the buffer has a virtual alias in userspace (e.g. because the kernel has GUP'd the buffer? In both cases, stale data (possibly containing random values written prior to the initial clearing of the page) can be read from the buffer. In case (1), this applies to the unwritten bytes after the transfer has completed and in case (2) it applies to the user alias of the whole buffer during the narrow window between arch_sync_dma_for_device() and the DMA writes landing in memory. The simplest fix (diff for arm64 below) seems to be changing the invalidation in this case to be a "clean" in arm(64)-speak so that any dirty lines are written back, therefore limiting the stale data to the initial buffer contents. In doing so, this makes the FROM_DEVICE and BIDIRECTIONAL cases identical which makes some intuitive sense if you think of FROM_DEVICE as first doing a TO_DEVICE of any dirty CPU cache lines. One interesting thing I noticed is that the csky implementation additionally zeroes the buffer prior to the clean, but this seems to be overkill. An alternative solution would be to ensure that newly allocated pages are clean rather than just zeroed; although this would then handle any variants of case (2) (e.g. where userspace can access a buffer with non-cacheable attributes), it's likely to have an intolerable performance cost. Finally, on arm(64), the DMA mapping code tries to deal with buffers that are not cacheline aligned by issuing clean-and-invalidate operations for the overlapping portions at each end of the buffer. I don't think this makes a tonne of sense, as inevitably one of the writers (either the CPU or the DMA) is going to win and somebody is going to run into silent data loss. Having the caller receive DMA_MAPPING_ERROR in this case would probably be better. Thoughts? I haven't tried to reproduce the problem above in practice and this code has been like this since the dawn of time, so we could also elect to leave it as-is... Will --->8 diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S index 0ea6cc25dc66..21c907987080 100644 --- a/arch/arm64/mm/cache.S +++ b/arch/arm64/mm/cache.S @@ -218,8 +218,6 @@ SYM_FUNC_ALIAS(__dma_flush_area, __pi___dma_flush_area) */ SYM_FUNC_START(__pi___dma_map_area) add x1, x0, x1 - cmp w2, #DMA_FROM_DEVICE - b.eq __pi_dcache_inval_poc b __pi_dcache_clean_poc SYM_FUNC_END(__pi___dma_map_area) SYM_FUNC_ALIAS(__dma_map_area, __pi___dma_map_area)