Re: New to PostgreSQL, performance considerations

"Merlin Moncure" <mmoncure@xxxxxxxxx> · Fri, 15 Dec 2006 10:55:14 -0500

On 12/15/06, Ron <rjpeace@xxxxxxxxxxxxx> wrote:
At 09:23 AM 12/15/2006, Merlin Moncure wrote:
>On 12/15/06, Ron <rjpeace@xxxxxxxxxxxxx> wrote:
>>
>>It seems unusual that code generation options which give access to
>>more registers would ever result in slower code...
>
>The slower is probably due to the unroll loops switch which can
>actually hurt code due to the larger footprint (less cache coherency).

I have seen that effect as well occasionally in the last few decades
;-)  OTOH, suspicion is not _proof_; and I've seen other
"optimizations" turn out to be "pessimizations" over the years as well.

>The extra registers are not all that important because of pipelining
>and other hardware tricks.

No.  Whoever told you this or gave you such an impression was
mistaken.  There are many instances of x86 compatible code that get
30-40% speedups just because they get access to 16 rather than 8 GPRs
when recompiled for x84-64.

I'm not debating that this is true in specific cases.  Encryption and
video en/decoding  have shown to be faster in 64 bit mode on the same
achicture (a cursor search in google will confirm this).  However,
32-64 bit is not the same argument since there are a lot of other
variables besides more registers.  64 bit mode is often slower on many
programs because the extra code size from 64 bit pointers.  We
benchmarked PostgreSQL internally here and found it to be fastest in
32 bit mode running on a 64 bit platform -- this was on a quad opteron
870 runnning our specific software stack, your results might be
differnt of course.

>Pretty much all the old assembly strategies such as forcing local
>variables to registers are basically obsolete...especially with
>regards to integer math.

Again, not true.  OTOH, humans are unlikely at this point to be able
to duplicate the accuracy of the compiler's register coloring
algorithms.  Especially on x86 compatibles.  (The flip side is that
_expert_ humans can often put the quirky register set and instruction
pipelines of x86 compatibles to more effective use for a specific
chunk of code than even the best compilers can.)

>As I said before, modern CPUs are essentially RISC engines with a
>CISC preprocessing engine laid in top.

I'm sure you meant modern =x86 compatible= CPUs are essentially RISC
engines with a CISC engine on top.  Just as "all the world's not a
VAX", "all CPUs are not x86 compatibles".  Forgetting this has
occasionally cost folks I know...

yes, In fact made this point earler.

>Things are much more complicated than they were in the old days
>where you could count instructions for the assembly optimization process.

Those were the =very= old days in computer time...

>I suspect that there is little or no differnece between the
>-march=686 and the various specifc archicectures.

There should be.  The FSF compiler folks (and the rest of the
industry compiler folks for that matter) are far from stupid.  They
are not just adding compiler switches because they randomly feel like it.

Evidence suggests that the most recent CPUs are in need of =more=
arch specific TLC compared to their ancestors, and that this trend is
not only going to continue, it is going to accelerate.

>Did anybody think to look at the binaries and look for the amount of
>differences?  I bet you code compiled for march=opteron will just
>fine on a pentium 2 if compiled
>for 32 bit.
Sucker bet given that the whole point of a 32b x86 compatible is to
be able to run code on any I32 ISA. CPU.
OTOH, I bet that code optimized for best performance on a P2 is not
getting best performance on a P4.  Or vice versa. ;-)

The big arch specific differences in Kx's are in 64b mode.  Not 32b

I dont think so.  IMO all the processor specific instruction sets were
hacks of 32 bit mode to optimize specific tasks.  Except for certain
things these instructions are rarely, if ever used in 64 bit mode,
especially in integer math (i.e. database binaries).  Since Intel and
AMD64 64 bit are virtually indentical I submit that -march is not
really important anymore except for very, very specific (but
important) cases like spinlocks.  This thread is about how much
architecture depenant binares can beat standard ones.  I say they
don't very much at all, and with the specific exception of Daniel's
benchmarking the results posted to this list bear that out.

merlin