On 6/11/2015 11:13 PM, xingjing lu wrote: > I am writting OpenMP programs under GCC compiler. And I want to know the > details about the overhead of GCC-OpenMP. My concerns are given below. > 1) What is the good way to optimize my OpenMP program? There are many > aspects that will affect the performance, such as load balancing, locality, > scheduling overhead, synchronization, and so on. In which order should I > check these aspects. > > 2) I want to know how to get the load balancing of my application under > GCC-OpenMP. How to instrument my application and the OpenMP runtime to > extract the load balancing feature? > > 3) I guess OpenMP will spend some time on scheduling. What runtime APIs > should I instrument to get the value of scheduling overhead? > > 4) Can I measure the time that OpenMP program spend on synchronization, > critical, lock and atomic operations? > I guess my original reply got triggered into something other than plain text by something here. If you're looking for a simple way to get some of the facilities of the Intel icc/VTune combination, remember that the Intel linux OpenMP library supports gcc OpenMP function calls. If you are on some other OS, like Windows, a profiler such as VTune or oprofile will show you where time is spent in OpenMP library wait loops, so you could attempt to infer answers to some of your questions. libgomp for Windows doesn't appear to be built by default with the right choice of debug symbols to permit capturing source line numbers. Poor locality could give rise to non-repeatable load imbalance. Scheduling overhead is affected by chunk size, and dynamic scheduling inherently gives non-repeatable results, although it can compensate somewhat for imbalance. You may have to experiment on whether explicit balancing of work among parallel for iterations will do a better job with your application on your platform. So it is unlikely that these performance issues can be dealt with one at a time. -- Tim Prince