On Thu, Jan 08, 2009 at 04:53:37PM -0800, Tim Prince wrote: > Michael Meissner wrote: > > On Wed, Jan 07, 2009 at 10:21:28AM -0500, Wirawan Purwanto wrote: > >> Hi Michael, > >> > >> Thanks for the answer. I would like to know if someone has investigated > >> this issue for some benchmark or real-world cases. Is there any > >> write-up/report/paper on this thing? > > > > > I suspect many people have done tests, but often times not published the > > results. For example, when I worked for AMD, I sometimes did SPEC runs with > > -mtune=generic, -mtune=athlon, -mtune=barcelona, or -mtune=core2 to see how the > > tunings affected the real hardware. I recall that there were a few benchmarks > > which saw noticible differences (how integer to fp conversions was one that I > > looked at for a bit). > > > -mtune=barcelona frequently speeds up vectorized loops on Core i7 by more > than a factor of 2, compared with generic. On Core 2, of course, it's > not clear cut, it speeds up more of my gfortran cases than it slows down, > with the reverse being true of g++. > There's not much mystery in this, as the major differences have to do with > the alignment requirements of various CPU models. > I thought integer to fp conversion would be more affected by -msse/sse2 > than by mtune. There are about 4 different methods to convert int to float if the integer value is in a GPR (direct GPR -> XMM conversion, Store -> Convert from memory, Store -> Load -> Parallel convert if memory serves). At a micro-level, AMD K8 is different from AMD Barcelona which is different from Intel Core2 which is different from Intel P4 (I imagine Intel i7 may be different as well). When you get into benchmarks, some things might be faster, even if the opt. guides say otherwise due to the effect of writing a value from a GPR to memory and reading the same value into an XMM register. -- Michael Meissner, IBM 4 Technology Place Drive, MS 2203A, Westford, MA, 01886, USA meissner@xxxxxxxxxxxxxxxxxx