Re: how to analysis and measure the GCC OpenMP performance and overhead

Tim Prince <n8tm@xxxxxxx> · Fri, 12 Jun 2015 08:04:07 -0400

On 6/11/2015 11:13 PM, xingjing lu wrote:
>   I am writting OpenMP programs under GCC compiler. And I want to know the
> details about the overhead of GCC-OpenMP. My concerns are given below.
> 1) What is the good way to optimize my OpenMP program? There are many
> aspects that will affect the performance, such as load balancing, locality,
> scheduling overhead, synchronization, and so on. In which order should I
> check these aspects.
>
> 2) I want to know how to get the load balancing of my application under
> GCC-OpenMP. How to instrument my application and the OpenMP runtime to
> extract the load balancing feature?
>
> 3) I guess OpenMP will spend some time on scheduling. What runtime APIs
> should I instrument to get the value of scheduling overhead?
>
> 4) Can I measure the time that OpenMP program spend on synchronization,
> critical, lock and atomic operations?
>
>
You imply that gcc (on some unspecified OS) should supply facilities
equivalent to the combination of icc and Intel VTune.  Note that the
Intel OpenMP library can be linked with a gcc linux compilation, as it
supports all the gcc as well as Intel OpenMP function calls.
Some of this can be inferred (with difficulty) by profiling with open
source tools like oprofile.

I just tried profiling some gcc Windows OpenMP with Intel tools. 
Although there is an ability to quote source line numbers associated
with parallel regions, it doesn't display source code (as VTune could do
with linux gcc) rather than disassembly.  I have yet to attempt a check
whether libgomp is or could be built with the right debug symbol options
to get some source line references there.  Evidently, those debug
options would depend on the target OS.

Not wishing to try to answer your question, I'll point out that poor
data locality may produce non-repeatable issues with load balancing etc.
so you can't deal with those performance issues independently.  Dynamic
scheduling to deal with load balancing is inherently somewhat
non-repeatable and difficult to analyze.  You may notice that the Intel
MKL performance library appears to make a preliminary load balancing
pass so as to use static schedule, but then depends on effective setting
of OMP_PLACES, OMP_PROC_BIND, ...

-- 
Tim Prince