Hi, it looks reasonable if you are using only a single core to do the test. The interconnect bandwidth is usually lower then the memory controller bandwidth, so there always this bottleneck. With many/most architectures (especially Nehalem / Westmere) you cannot saturate all the memory bandwith with only a single core, so you need to do the test with multiple or better with all the cores. On my machine I have a bandwidth delta of around 35 %, when accessing memory on node 1 and running the application on node 0. This is what I've tried out: wget www.cs.virginia.edu/stream/FTP/Code/stream.c gcc -fopenmp -mcmodel=3Dmedium -O -DSTREAM_ARRAY_SIZE=3D134217728 -DNTIMES=3D20 stream.c -o stream.1GiB.20 ######################################################### numactl --cpunodebind=3D0 --membind=3D0 ./stream.1GiB.20 ------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size =3D 134217728 (elements), Offset =3D 0 (elements) Memory per array =3D 1024.0 MiB (=3D 1.0 GiB). Total memory required =3D 3072.0 MiB (=3D 3.0 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested =3D 20 Number of Threads counted =3D 20 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 104100 microseconds. (=3D 104100 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 14258.8 0.150689 0.150608 0.150890 Scale: 14255.4 0.150742 0.150643 0.150935 Add: 15889.5 0.202965 0.202727 0.203927 Triad: 15904.0 0.202771 0.202542 0.203169 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays ------------------------------------------------------------- ######################################################### numactl --cpunodebind=3D0 --membind=3D1 ./stream.1GiB.20 ------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size =3D 134217728 (elements), Offset =3D 0 (elements) Memory per array =3D 1024.0 MiB (=3D 1.0 GiB). Total memory required =3D 3072.0 MiB (=3D 3.0 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested =3D 20 Number of Threads counted =3D 20 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 146572 microseconds. (=3D 146572 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 9763.3 0.220143 0.219955 0.220324 Scale: 9758.8 0.220277 0.220056 0.220557 Add: 10442.6 0.308598 0.308469 0.308825 Triad: 10453.1 0.308435 0.308159 0.308759 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays ------------------------------------------------------------- It can get even worse if the local and the remote CPUs are accessing memory on a single node. I get arround 50 % of the bandwidth, which makes sense using 1 instead of 2 memory controllers. numactl --cpunodebind=3D0,1 --membind=3D1 ./stream.1GiB.20 ------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size =3D 134217728 (elements), Offset =3D 0 (elements) Memory per array =3D 1024.0 MiB (=3D 1.0 GiB). Total memory required =3D 3072.0 MiB (=3D 3.0 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested =3D 40 Number of Threads counted =3D 40 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 121801 microseconds. (=3D 121801 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 12337.7 0.174139 0.174059 0.174325 Scale: 12296.9 0.174744 0.174637 0.174885 Add: 14211.8 0.226803 0.226659 0.226928 Triad: 14251.7 0.226201 0.226024 0.226418 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays ------------------------------------------------------------- ######################################################### numactl --cpunodebind=3D0,1 --membind=3D0,1 ./stream.1GiB.20 ------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size =3D 134217728 (elements), Offset =3D 0 (elements) Memory per array =3D 1024.0 MiB (=3D 1.0 GiB). Total memory required =3D 3072.0 MiB (=3D 3.0 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested =3D 40 Number of Threads counted =3D 40 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 56900 microseconds. (=3D 56900 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 27330.0 0.078743 0.078576 0.079084 Scale: 27239.9 0.079228 0.078836 0.080867 Add: 30466.9 0.105910 0.105729 0.106404 Triad: 30504.9 0.105749 0.105597 0.106209 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays ------------------------------------------------------------- Regards, Andreas 2014-11-25 16:11 GMT+01:00 Serge A <serge.ayoun@xxxxxxxxx>: > I am new to this list and hope this is the right place to ask this question. > > > > We are on process to check performance of a new hardware feature on NUMA > system. > > I did a few experiments to check memory performance behavior on a two node > NUMA system with the following SLIT: > > > > 10 21 > > 21 10 > > > > The test does the following: allocate 1 GB buffer and access the buffer > sequentially with a 256bytes stride, i.e. accessing > > Offset 0, 256, 512, ….. > > This test was performed while running on node 0 (using numa_run_on_node() > API) done while accessing local node memory 0 and also remote node 1 memory > (using(numa_alloc_onnode()) API). > > > > Surprisingly, the performance delta is about 15%: local memory test took > 1800ms while the remote one took 2100ms. I expected much more. > > > > Two questions: > > Do these results look reasonable to you? > > Is there some kind of recommended and standard memory test for NUMA system? > > > > Thanks, > > Serge > > -- > To unsubscribe from this list: send the line "unsubscribe linux-numa" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-numa" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html