On Wed, Feb 24, 2010 at 08:55:12AM -0800, David Daney wrote: > It is possible that by choosing a better nudge_writes() > implementation for R10K, that the 3% degradation could be erased. > Perhaps: > > #define nudge_writes() do { } while (0) raw_spin_unlock must provide a barrier so this wouldn't be a valid implementation for nudge_writes(). Implementing it as barrier() this is a pure compiler barrier is the most liberal valid implementation. > Basically you want something that is fast, but that also forces the > write to be globally visible as soon as possible. Some processors > have a prefetch instruction that does this. On other processors a > NOP is optimal as they don't combine writes in the write back > buffer. > > There is a wbflush() function that could potentially be used, but > its implementation is too heavy on Octeon. For IP27 which is a strongly ordered system nudge_writes() is implemented as barrier(). Another experiment I did was alignment. A branch on an R10000 has a significant execution time penalty if it's delay slot is overlapping a 128 byte S-cache boundary. Suitable alignment however didn't not seem to make any difference at all on R10000. Ralf