Re: memcpy and prefetch

Ralf Baechle <ralf@xxxxxxxxxxxxxx> · Wed, 4 Feb 2009 21:27:46 +0000

On Thu, Jan 29, 2009 at 10:39:37PM -0500, David VomLehn (dvomlehn) wrote:

> > The idea here is that we have two issues with prefetching:
> > 
> >  o Prefetching beyond the end of the source or destination range on a
> >    in-coherent range might bring back stale values from a DMA I/O
> >    buffer resulting in data corruption.  Hardware DMA coherency will
> >    avoid this issue.
> > 
> >  o IP27 has full blown hardware coherency.  Historically 
> > CONFIG_DMA_COHERENT
> >    was not able to cope with something of the complexity of IP27, so
> >    there was a separate CONFIG_DMA_IP27 and the broken logic 
> > expression
> >    was meant to treat CONFIG_DMA_COHERENT and CONFIG_DMA_IP27 the same
> >    as for prefetching.
> > 
> >  o Prefetching beyond the end of physical memory can cause 
> > exceptions on
> >    some systems.  The Malta has this problem.
> > 
> > Thus no prefetching on Malta or non-coherent systems.

> It seems to me as though we could avoid the first and third problems
> with a memcpy that doesn't prefetch past the end of the buffer, the
> thought being that if we are reading or writing a memory region, we
> really shouldn't be doing DMA to or from that location. This would
> probably be slightly suboptimal, performance-wise, for those systems
> that do have DMA coherence. It seems as though we could have two
> mutually exclusive versions, selectable via the CONFIG_DMA_COHERENT
> flag. For those of us without DMA coherence, it would probably give our
> memcpy performance a bit of a kick in the pants over using no prefetch
> at all.

Unnecessary prefetching can come at a high cost due to memory latencies
and cache pollution.  So you want to avoid unnecessary prefetches rather
than hoping for hardware cache coherency to sorts out the mess software
left behind.

The general expectation is that prefetching will help - but depending on
the pipeline structure prefetching can be hard to exploit optimally.  For
example there are MIPS cores were the optimal sequence is something like

  load store load store load store load store

But on others it's

  load load load load store store store store

Placing prefetching instructions into loops built from such blocks can
result in very surprising result.

> If this makes sense, we might be able to sign up to do the work. Anyone
> have a good, caching-aware memcpy test?

Testing memcpy is an interesting little project.  Correctness is one
thing but a good implementation needs to do a few performance tradeoffs
which are best meassure with real world, not synthetic workloads.

  Ralf