On 2022-03-25 10:25, Maxime Bizon wrote:
On Thu, 2022-03-24 at 12:26 -0700, Linus Torvalds wrote:
It's actually very natural in that situation to flush the caches from
the CPU side again. And so dma_sync_single_for_device() is a fairly
reasonable thing to do in that situation.
In the non-cache-coherent scenario, and assuming dma_map() did an
initial cache invalidation, you can write this:
rx_buffer_complete_1(buf)
{
invalidate_cache(buf, size)
if (!is_ready(buf))
return;
<proceed with receive>
}
or
rx_buffer_complete_2(buf)
{
if (!is_ready(buf)) {
invalidate_cache(buf, size)
return;
}
<proceed with receive>
}
The latter is preferred for performance because dma_map() did the
initial invalidate.
Of course you could write:
rx_buffer_complete_3(buf)
{
invalidate_cache(buf, size)
if
(!is_ready(buf)) {
invalidate_cache(buf, size)
return;
}
<proceed with receive>
}
but it's a waste of CPU cycles
So I'd be very cautious assuming sync_for_cpu() and sync_for_device()
are both doing invalidation in existing implementation of arch DMA ops,
implementers may have taken some liberty around DMA-API to avoid
unnecessary cache operation (not to blame them).
Right, if you have speculatively-prefetching caches, you have to
invalidate DMA_FROM_DEVICE in unmap/sync_for_cpu, since a cache may have
pulled in a snapshot of partly-written data at any point beforehand. But
if you don't, then you can simply invalidate up-front in
map/sync_for_device to tie in with the other directions, and trust that
it stays that way for the duration.
What muddies the waters a bit is that the opposite combination
sync_for_cpu(DMA_TO_DEVICE) really *should* always be a no-op, and I for
one have already made the case for eliding that in code elsewhere, but
it doesn't necessarily hold for the inverse here, hence why I'm not sure
there even is a robust common solution for peeking at a live
DMA_FROM_DEVICE buffer.
Robin.
For example looking at arch/arm/mm/dma-mapping.c, for DMA_FROM_DEVICE
sync_single_for_device()
=> __dma_page_cpu_to_dev()
=> dma_cache_maint_page(op=dmac_map_area)
=> cpu_cache.dma_map_area()
sync_single_for_cpu()
=> __dma_page_dev_to_cpu()
=>
__dma_page_cpu_to_dev(op=dmac_unmap_area)
=>
cpu_cache.dma_unmap_area()
dma_map_area() always does cache invalidate.
But for a couple of CPU variant, dma_unmap_area() is a noop, so
sync_for_cpu() does nothing.
Toke's patch will break ath9k on those platforms (mostly silent
breakage, rx corruption leading to bad performance)
There's a fair number of those dma_sync_single_for_device() things
all over. Could we find mis-uses and warn about them some way? It
seems to be a very natural thing to do in this context, but bounce
buffering does make them very fragile.
At least in network drivers, there are at least two patterns:
1) The issue at hand, hardware mixing rx_status and data inside the
same area. Usually very old hardware, very quick grep in network
drivers only revealed slicoss.c. Probably would have gone unnoticed if
ath9k hardware wasn't so common.
2) The very common "copy break" pattern. If a received packet is
smaller than a certain threshold, the driver rx path is changed to do:
sync_for_cpu()
alloc_small_skb()
memcpy(small_skb, rx_buffer_data)
sync_for_device()
Original skb is left in the hardware, this reduces memory wasted.
This pattern is completely valid wrt DMA-API, the buffer is always
either owned by CPU or device.