Nelson, Thanks for your advice. I just figured out that the perceived lack of a speedup was illusory: I was looking at the CPU time rather than the wall-clock time, so that resolved my primary concern, but... > (1) Fortran arrays are stored with the first subscript increasing most > rapidly, the opposite of that used for C and C++. Reversing the loop > order will make better use of cache. This made a huge difference whether using OpenMP or not, thanks! > (2) The second problem is the dimensions ("I've set nx and ny so large > (1000 and 5000..."). To avoid cache conflicts, you want to choose the > number of rows to be something other than a power of 2: a prime number > is often a good choice. I have an example in my files of a program > that ran about 3 times faster just by changing a row dimension from > 256 (where there were cache collisions along the row) to 257 (where > cache collisions are rare). I would NEVER have figured this out, thanks. In the current application the problem dictates the sizes of my arrays, so I can't really use the tip, but I'll keep it in mind in the future. > You should also check the generated assembly code (f77 -S foo.f) > whether C(i,j)**2 is compiled into the inline code C(i,j)*C(i,j), or > into call to the run-time library power function, and also whether the > subscript address computations are eliminated. I'll just inline it manually to be sure. I was trying to get a speedup from openMP in that subroutine, not necessarily optimize overall. Anand