On 6/11/2015 11:13 PM, xingjing lu wrote: > I am writting OpenMP programs under GCC compiler. And I want to know the > details about the overhead of GCC-OpenMP. My concerns are given below. > 1) What is the good way to optimize my OpenMP program? There are many > aspects that will affect the performance, such as load balancing, locality, > scheduling overhead, synchronization, and so on. In which order should I > check these aspects. > > 2) I want to know how to get the load balancing of my application under > GCC-OpenMP. How to instrument my application and the OpenMP runtime to > extract the load balancing feature? > > 3) I guess OpenMP will spend some time on scheduling. What runtime APIs > should I instrument to get the value of scheduling overhead? > > 4) Can I measure the time that OpenMP program spend on synchronization, > critical, lock and atomic operations? > > You imply that gcc (on some unspecified OS) should supply facilities equivalent to the combination of icc and Intel VTune. Note that the Intel OpenMP library can be linked with a gcc linux compilation, as it supports all the gcc as well as Intel OpenMP function calls. Some of this can be inferred (with difficulty) by profiling with open source tools like oprofile. I just tried profiling some gcc Windows OpenMP with Intel tools. Although there is an ability to quote source line numbers associated with parallel regions, it doesn't display source code (as VTune could do with linux gcc) rather than disassembly. I have yet to attempt a check whether libgomp is or could be built with the right debug symbol options to get some source line references there. Evidently, those debug options would depend on the target OS. Not wishing to try to answer your question, I'll point out that poor data locality may produce non-repeatable issues with load balancing etc. so you can't deal with those performance issues independently. Dynamic scheduling to deal with load balancing is inherently somewhat non-repeatable and difficult to analyze. You may notice that the Intel MKL performance library appears to make a preliminary load balancing pass so as to use static schedule, but then depends on effective setting of OMP_PLACES, OMP_PROC_BIND, ... -- Tim Prince