dma_sync_*_for_cpu and direction=TO_DEVICE (was Re: [PATCH 02/20] dma-mapping: provide a generic dma-noncoherent implementation)

linux@xxxxxxxxxxxxxxx (Russell King - ARM Linux) · Fri, 18 May 2018 22:33:10 +0100

On Fri, May 18, 2018 at 07:57:34PM +0000, Alexey Brodkin wrote:
> Hi Russel,

That's Russell.

> On Fri, 2018-05-18 at 18:50 +0100, Russell King - ARM Linux wrote:
> > It's necessary.  Take a moment to think carefully about this:
> > 
> >         dma_map_single(, dir)
> > 
> >         dma_sync_single_for_cpu(, dir)
> > 
> >         dma_sync_single_for_device(, dir)
> > 
> >         dma_unmap_single(, dir)
> > 
> > In the case of a DMA-incoherent architecture, the operations done at each
> > stage depend on the direction argument:
> > 
> >         map             for_cpu         for_device      unmap
> > TO_DEV  writeback       none            writeback       none
> > TO_CPU  invalidate      invalidate*     invalidate      invalidate*
> > BIDIR   writeback       invalidate      writeback       invalidate
> > 
> > * - only necessary if the CPU speculatively prefetches.
> 
> I think invalidation of DMA buffer is required on for_cpu(TO_CPU) even
> if CPU doesn't preferch - what if we reuse the same buffer for multiple
> reads from DMA device?

That's fine - for non-coherent DMA, the CPU caches will only end up
containing data for that memory if:
- the CPU speculatively fetches data from that memory, or
- the CPU explicitly touches that memory

> > The multiple invalidations for the TO_CPU case handles different
> > conditions that can result in data corruption, and for some CPUs, all
> > four are necessary.
> 
> I would agree that map()/unmap() a quite a special cases and so depending
> on direction we need to execute in them either for_cpu() or for_device()
> call-backs depending on direction.
> 
> As for invalidation in case of for_device(TO_CPU) I still don't see
> a rationale behind it. Would be interesting to see a real example where
> we benefit from this.

Yes, you could avoid that, but depending how you structure the
architecture implementation, it can turn out to be a corner case.
The above table is precisely how 32-bit ARM is implemented, because
the way we implement them is based on who owns the memory - the "map"
and "for_device" operations translate internally to a cpu-to-device
ownership transition of the buffer.  Similar for "unmap" and "to_cpu".
It basically avoids having to add additional functions at the lower
implementation levels.

> > Things get more interesting if the implementation behind the DMA API has
> > to copy data between the buffer supplied to the mapping and some DMA
> > accessible buffer:
> > 
> >         map             for_cpu         for_device      unmap
> > TO_DEV  copy to dma     none            copy to dma     none
> > TO_CPU  none            copy to cpu     none            copy to cpu
> > BIDIR   copy to dma     copy to cpu     copy to dma     copy to cpu
> > 
> > So, in both cases, the value of the direction argument defines what you
> > need to do in each call.
> 
> Interesting enough in your seond table (which describes more complicated
> case indeed) you set "none" for for_device(TO_CPU) which looks logical
> to me.
> 
> So IMHO that's what make sense:
> ---------------------------->8-----------------------------
>         map             for_cpu         for_device      unmap
> TO_DEV  writeback       none            writeback       none
> TO_CPU  none            invalidate      none            invalidate*
> BIDIR   writeback       invalidate      writeback       invalidate*
> ---------------------------->8-----------------------------

That doesn't make sense for the TO_CPU case.  If the caches contain
dirty cache lines, and you're DMAing from the device to the system
RAM, other system activity can cause the dirty cache lines to be
evicted (written back) to memory which the DMA has already overwritten.
The result is data corruption.  So, you really can't have "none" in
the "map" case there.

Given that, the "for_cpu" case becomes dependent on whether the CPU
speculatively prefetches.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up