Re: [PATCH] parisc: adjust L1_CACHE_BYTES to 128 bytes on PA8800 and PA8900 CPUs

Helge Deller <deller@xxxxxx> · Tue, 22 Sep 2015 18:20:03 +0200

On 05.09.2015 23:30, Helge Deller wrote:
> Hi James,
> ...
> I haven't done any performance measurements yet, but your patch looks
> absolutely correct.
> ...

Hello everyone,

I did some timing tests with the various patches for
a) atomic_hash patches:
        https://patchwork.kernel.org/patch/7116811/
b) alignment of LWS locks:
        https://patchwork.kernel.org/patch/7137931/

The testcase I used is basically the following:
- It starts 32 threads.
- We have 16 atomic ints organized in an array.
- The first thread increments NITERS times the first atomic int.
- The second thread decrements NITERS times the first atomic int.
- The third/fourth thread increments/decrements the second atomic int, and so on...
- So, we have 32 threads, of which 16 increments and 16 decrements 16 different atomic ints.
- All threads run in parallel on a 4-way SMP PA8700 rp5470 machine.
- I used the "time" command to measure the timings.
- I did not stopped other services on the machine, but ran each test a few times and the timing results did not show significant variation between each run.
- All timings were done on a vanilla kernel 4.2 with only the mentioned patch applied.

The code is a modified testcase from the libatomic-ops debian package:

AO_t counter_array[16] = { 0, };
#define NITERS 1000000

void * add1sub1_thr(void * id)
{
  int me = (int)(AO_PTRDIFF_T)id;
  AO_t *counter;
  int i;

  counter = &counter_array[me >> 1];
  for (i = 0; i < NITERS; ++i)
    if ((me & 1) != 0) {
      (void)AO_fetch_and_sub1(counter);
    } else {
      (void)AO_fetch_and_add1(counter);
    }
  return 0;
...
run_parallel(32, add1sub1_thr)
...

The baseline for all results is the timing with a vanilla kernel 4.2:
real    0m13.596s
user    0m18.152s
sys     0m35.752s

The next results are with the atomic_hash (a) patch applied:
For ATOMIC_HASH_SIZE = 4. 
real    0m21.892s 
user    0m27.492s
sys     0m59.704s

For ATOMIC_HASH_SIZE = 64.
real    0m20.604s
user    0m24.832s
sys     0m56.552s

Next I applied the LWS locks patch (b):
XXXXXXXXXXXXXXXXXXXX    LWS_LOCK_ALIGN_BITS = 4
real    0m13.660s
user    0m18.592s
sys     0m35.236s

XXXXXXXXXXXXXXXXXXXX    LWS_LOCK_ALIGN_BITS  = L1_CACHE_SHIFT
real    0m11.992s 
user    0m19.064s
sys     0m28.476s

Then I applied both patches (a and b):
ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = 4
ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = 4
real    0m13.232s
user    0m17.704s
sys     0m33.884s

ATOMIC_HASH_SIZE = 64,  LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT
real    0m12.300s
user    0m20.268s
sys     0m28.424s

ATOMIC_HASH_SIZE = 4,   LWS_LOCK_ALIGN_BITS = 4
real    0m13.181s
user    0m17.584s
sys     0m34.800s

ATOMIC_HASH_SIZE = 4,   LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT
real    0m11.692s
user    0m18.232s
sys     0m27.072s

In summary I'm astonished about those results.
Especially from patch (a) I would have expected (when applied stand-alone) the same or better performance, because it makes the spinlocks more fine-grained.
But a performance drop of 50% is strange.

Patch (b) stand-alone does significantly increases performance (~20%), and together with patch (a) it even adds a few more percent performance increase on top.

Given the numbers above I currently would suggest to apply both patches (with ATOMIC_HASH_SIZE = 4 and LWS_LOCK_ALIGN_BITS = L1_CACHE_SHIFT).

Thoughts?

Helge
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html