On 02/25/2010 06:15 AM, Ralf Baechle wrote:
On Wed, Feb 24, 2010 at 08:55:12AM -0800, David Daney wrote:
It is possible that by choosing a better nudge_writes()
implementation for R10K, that the 3% degradation could be erased.
Perhaps:
#define nudge_writes() do { } while (0)
raw_spin_unlock must provide a barrier so this wouldn't be a valid
implementation for nudge_writes().
That barrier is separate (and present). The sole purpose of
nudge_writes() is to make speed up the global visibility of the
releasing write, it does not have anything to do with locking semantics.
Implementing it as barrier() this
is a pure compiler barrier is the most liberal valid implementation.
No, the most liberal would be a true NOP: 'do { } while (0)'.
Basically you want something that is fast, but that also forces the
write to be globally visible as soon as possible. Some processors
have a prefetch instruction that does this. On other processors a
NOP is optimal as they don't combine writes in the write back
buffer.
There is a wbflush() function that could potentially be used, but
its implementation is too heavy on Octeon.
For IP27 which is a strongly ordered system nudge_writes() is implemented
as barrier().
Another experiment I did was alignment. A branch on an R10000 has a
significant execution time penalty if it's delay slot is overlapping a
128 byte S-cache boundary. Suitable alignment however didn't not seem
to make any difference at all on R10000.
Ralf