Re: Some performance figure questions.

Andreas Hollmann <hollmann@xxxxxxxxx> · Tue, 25 Nov 2014 18:35:32 +0100

Hi,

it looks reasonable if you are using only a single core to do the test.

The interconnect bandwidth is usually lower then the
memory controller bandwidth, so there always this bottleneck.

With many/most architectures (especially Nehalem / Westmere) you cannot
saturate all the memory bandwith with only a single core, so you
need to do the test with multiple or better with all the cores.

On my machine I have a bandwidth delta of around 35 %, when
accessing memory on node 1 and running the application on node 0.

This is what I've tried out:

wget www.cs.virginia.edu/stream/FTP/Code/stream.c
gcc -fopenmp -mcmodel=3Dmedium -O -DSTREAM_ARRAY_SIZE=3D134217728
-DNTIMES=3D20 stream.c -o stream.1GiB.20

#########################################################

numactl --cpunodebind=3D0 --membind=3D0  ./stream.1GiB.20
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size =3D 134217728 (elements), Offset =3D 0 (elements)
Memory per array =3D 1024.0 MiB (=3D 1.0 GiB).
Total memory required =3D 3072.0 MiB (=3D 3.0 GiB).
Each kernel will be executed 20 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested =3D 20
Number of Threads counted =3D 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 104100 microseconds.
   (=3D 104100 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           14258.8     0.150689     0.150608     0.150890
Scale:          14255.4     0.150742     0.150643     0.150935
Add:            15889.5     0.202965     0.202727     0.203927
Triad:          15904.0     0.202771     0.202542     0.203169
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

#########################################################

numactl --cpunodebind=3D0 --membind=3D1  ./stream.1GiB.20
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size =3D 134217728 (elements), Offset =3D 0 (elements)
Memory per array =3D 1024.0 MiB (=3D 1.0 GiB).
Total memory required =3D 3072.0 MiB (=3D 3.0 GiB).
Each kernel will be executed 20 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested =3D 20
Number of Threads counted =3D 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 146572 microseconds.
   (=3D 146572 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            9763.3     0.220143     0.219955     0.220324
Scale:           9758.8     0.220277     0.220056     0.220557
Add:            10442.6     0.308598     0.308469     0.308825
Triad:          10453.1     0.308435     0.308159     0.308759
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

It can get even worse if the local and the remote CPUs are accessing
memory on a single node. I get arround 50 % of the bandwidth, which
makes sense using 1 instead of 2 memory controllers.

numactl --cpunodebind=3D0,1 --membind=3D1  ./stream.1GiB.20
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size =3D 134217728 (elements), Offset =3D 0 (elements)
Memory per array =3D 1024.0 MiB (=3D 1.0 GiB).
Total memory required =3D 3072.0 MiB (=3D 3.0 GiB).
Each kernel will be executed 20 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested =3D 40
Number of Threads counted =3D 40
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 121801 microseconds.
   (=3D 121801 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           12337.7     0.174139     0.174059     0.174325
Scale:          12296.9     0.174744     0.174637     0.174885
Add:            14211.8     0.226803     0.226659     0.226928
Triad:          14251.7     0.226201     0.226024     0.226418
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

#########################################################

numactl --cpunodebind=3D0,1 --membind=3D0,1  ./stream.1GiB.20
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size =3D 134217728 (elements), Offset =3D 0 (elements)
Memory per array =3D 1024.0 MiB (=3D 1.0 GiB).
Total memory required =3D 3072.0 MiB (=3D 3.0 GiB).
Each kernel will be executed 20 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested =3D 40
Number of Threads counted =3D 40
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 56900 microseconds.
   (=3D 56900 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           27330.0     0.078743     0.078576     0.079084
Scale:          27239.9     0.079228     0.078836     0.080867
Add:            30466.9     0.105910     0.105729     0.106404
Triad:          30504.9     0.105749     0.105597     0.106209
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Regards,
Andreas

2014-11-25 16:11 GMT+01:00 Serge A <serge.ayoun@xxxxxxxxx>:
> I am new to this list and hope this is the right place to ask this question.
>
>
>
> We are on process to check performance of a new hardware feature on NUMA
> system.
>
> I did a few experiments to check memory performance behavior on a two node
> NUMA system with the following SLIT:
>
>
>
> 10 21
>
> 21 10
>
>
>
> The test does the following: allocate 1 GB buffer and access the buffer
> sequentially with a 256bytes stride, i.e. accessing
>
> Offset 0, 256, 512, …..
>
> This test was performed while running on node 0 (using numa_run_on_node()
> API) done while accessing local node memory 0 and also remote node 1 memory
> (using(numa_alloc_onnode()) API).
>
>
>
> Surprisingly, the performance delta is about 15%: local memory test took
> 1800ms while the remote one took 2100ms. I expected much more.
>
>
>
> Two questions:
>
> Do these results look reasonable to you?
>
> Is there some kind of recommended and standard memory test for NUMA system?
>
>
>
> Thanks,
>
> Serge
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-numa" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html