Cache maintenance for non-coherent DMA in arch_sync_dma_for_device()

Will Deacon <will@xxxxxxxxxx> · Mon, 6 Jun 2022 16:21:50 +0100

[CC'ing a bunch of arch maintainers who are not "obviously" immune to
 this issue; please can you check your archs?]

Hi folks,

Some off-list discussions about the arm64 implementation of
arch_sync_dma_for_device() have left me unsure about the following
architectures in this area:

arc
arm
arm64
hexagon
m68k
microblaze
nios2
openrisc
powerpc
sh

The background is that when using the streaming DMA API to map a buffer
prior to inbound non-coherent DMA (i.e. DMA_FROM_DEVICE), it is often
necessary to remove any dirty CPU cache lines so that they can't get
evicted during the transfer and corrupt the buffer contents written by
the DMA. The architectures listed above appear to achieve this using
*invalidation*, which tends to imply discarding of the CPU cache lines
rather than writing them back, and this consequently raises two related
questions:

  (1) What if the DMA transfer doesn't write to every byte in the buffer?
  (2) What if the buffer has a virtual alias in userspace (e.g. because
      the kernel has GUP'd the buffer?

In both cases, stale data (possibly containing random values written
prior to the initial clearing of the page) can be read from the buffer.
In case (1), this applies to the unwritten bytes after the transfer has
completed and in case (2) it applies to the user alias of the whole
buffer during the narrow window between arch_sync_dma_for_device() and
the DMA writes landing in memory.

The simplest fix (diff for arm64 below) seems to be changing the
invalidation in this case to be a "clean" in arm(64)-speak so that any
dirty lines are written back, therefore limiting the stale data to the
initial buffer contents. In doing so, this makes the FROM_DEVICE and
BIDIRECTIONAL cases identical which makes some intuitive sense if you
think of FROM_DEVICE as first doing a TO_DEVICE of any dirty CPU cache
lines. One interesting thing I noticed is that the csky implementation
additionally zeroes the buffer prior to the clean, but this seems to be
overkill.

An alternative solution would be to ensure that newly allocated pages
are clean rather than just zeroed; although this would then handle any
variants of case (2) (e.g. where userspace can access a buffer with
non-cacheable attributes), it's likely to have an intolerable
performance cost.

Finally, on arm(64), the DMA mapping code tries to deal with buffers
that are not cacheline aligned by issuing clean-and-invalidate
operations for the overlapping portions at each end of the buffer. I
don't think this makes a tonne of sense, as inevitably one of the
writers (either the CPU or the DMA) is going to win and somebody is
going to run into silent data loss. Having the caller receive
DMA_MAPPING_ERROR in this case would probably be better.

Thoughts? I haven't tried to reproduce the problem above in practice and
this code has been like this since the dawn of time, so we could also
elect to leave it as-is...

Will

--->8

diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S
index 0ea6cc25dc66..21c907987080 100644
--- a/arch/arm64/mm/cache.S
+++ b/arch/arm64/mm/cache.S
@@ -218,8 +218,6 @@ SYM_FUNC_ALIAS(__dma_flush_area, __pi___dma_flush_area)
  */
 SYM_FUNC_START(__pi___dma_map_area)
        add     x1, x0, x1
-       cmp     w2, #DMA_FROM_DEVICE
-       b.eq    __pi_dcache_inval_poc
        b       __pi_dcache_clean_poc
 SYM_FUNC_END(__pi___dma_map_area)
 SYM_FUNC_ALIAS(__dma_map_area, __pi___dma_map_area)