In trying to understand the prefetch code in memcpy it looks like it's
prefetching too far out in front of the loop. In the main aligned loop
the loop copies 32 or 64 bytes of data and the prefetch is trying to
prefetch 256 bytes ahead of the current copy. The prefetches should also
pay attention to cache line size and they currently don't. If the line
size is less than the copy size we are skipping prefetches that should
be done. For the 4kc the line size is only 16 bytes. We should be doing
a prefetch for each line. The src_unaligned_dst_aligned loop is even
worse as it prefetches 288 bytes ahead of the copy and only copies 16 or
32 bytes at a time.
Have I totally misunderstood the code?
Greg Weeks