Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?

Vincent Diepeveen <diep@xxxxxxxxx> · Tue, 25 Mar 2014 16:51:21 +0100 (CET)

On Tue, 25 Mar 2014, David Brown wrote:

On 25/03/14 04:31, Xinrong Fu wrote:
Hi guys:
   What does the number of stalled cycles in the CPU pipeline frontend
means? Why is the stalled frontend cycles of 32bit program more than
64bit program's stalled cycles when they running on same 64bit system?
Is there any gcc options to fix it?

If the question is: Why a 32 bits program compiled 
64 bits is a lot slower:

There can be several reasons yet let's name a few.

a) for example if you use signed 32 bits indexation, for example

int i, array[64];

i = ...;
x = array[i];

this goes very fast in 32 bits processor and 32 bits mode yet a lot slower 
in 64 bits mode, as i needs a sign extension to 64 bits.
So the compiler generates 1 additional instruction in 64 bits mode
to sign extend i from 32 bits to 64 bits.

b) some processors can 'issue' more 32 bits instructions a clock than 64 
bits instructions. This can have many reasons, for example the processor 
can just decode a limited amount of bytes per clock and as 32 bits 
instructions occupy less space that means they can decode 4 instructions 
of 32 bits yet just 3 of 64 bits. Please note: not taking into account 
vector instructions here, just seeing an instruction as an 
absolute instruction here and not taking into account how wide the 
register is upon which it operates.

Agner Fog has more exact measurements on how little bytes modern 
processors can actually decode a clock.

My chessprogram Diep which is deterministic integer code (so no 
vector codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 
64 bits than in 32 bits. This where it does use a few 64 bits datatypes (very 
little though). In 64 bits the datasize used doesn't grow, instruction 
wise it grows immense of course.

Besides the above reasons another reason why 32 bits programs compiled 64 
bits can be a lot slower in case of Diep is:

c) the larger code size causes more L1 instruction cache misses.

And that i a major problem especially as those L1i's are already so tiny 
at modern processors.

d) gcc is total horrible in optimizing branches. Where a compiler like 
intel c++ easily gets 20%-25% performance out of the PGO (profile guided 
optimization), gcc gets total peanuts out of the pgo phase for my 
chessprogram. 3% or so.

This all has to do with how it deals with branches and the horrible 
optimizations that trigger.

Now these horrors you could either benefit from going to 64 
bits, as it no longer uses that horror by then, or 
get additional penalty from moving to 64 bits. That last case for example 
at an older generation AMD processor when the jump suddenly is outside of 
what the processor is seeing in its lookahead, causing a huge penalty 
suddenly for a branch mispredict.

It largely depends upon the processor you have, especially older types AMD 
processors suffer there.

So moving there to 64 bits could speed you up occasional, even when not 
using any optimization at all, just because some of the old FUBAR codes 
used for 32 bits no longer get triggered.

Are you asking why the same program runs faster when compiled as 64-bit
rather than 32-bit?  There are /many/ reasons why 64-bit x86 code can be
faster than 32-bit x86 code - without having any idea about your code,
we can only make general points.  In comparison to 32-bit x86, the
64-bit mode has access to more registers,

Usually processors are optimized to just use a few registers whereas 
they use all sorts of tricks (where additional registers get used) already 
to make up for it, so the additional registers hardly is an
advantage of any kind in 64 bits, not even in algorithmic codes here.

In tests performed using more registers using assembler code the 
processors actually slow down. So there is a performance benefit in 
reusing the same few registers over and over again.

This performance penalty of using more registers is not only there in x64, 
it already was the case in x86 processors. In fact it's easy to measure in 
the pentium from 2 decades ago already.

Hopefully that'll change a tad in the future - yet i consider that 
unlikely, as it also would involve changes in the intel c++ compiler.

has wider registers (which

Exactly:

If you use 64 bits datatypes like "long long" then obviously 64 bits is 
a huge advantage over 32 bits. This can easily give a factor 2 speed 
improvement in case of integer codes that are 64 bits.

speeds data movement), less complicated instruction decoding and
instruction prefixes, more efficient floating point, and much more
efficient calling conventions.  It has the disadvantage that pointers
take up twice as much data cache and memory bandwidth, as they are twice
the size.

From a distance seen you're totally correct here that caches are the 
problem. To zoom in: the larger pointer is more of a problem for the 
instruction part of the cache.

In itself the larger pointer doesn't mean the size the data occupies in 
the datacache grows.

Yet in the compiler in 64 bits needs more instructions to get to the 32 
bits data and such 64 bits pointer instructions simply are larger laying 
more stress upon the instruction decoding/transport, whereas we already 
know it can just decode 3 hands of bytes a clock.

Now for a lot of programs this isn't a big problem as another bitwise AND 
is a very fast instruction, yet for my software which is pretty optimized 
i feel additional instructions not in the last place as it makes the 
already supertiny L1 instruction cache ugh out even more and as the IPC 
already is very high :)

As for gcc options to "fix" it, there is no problem to fix - it is
normal that 64-bit code is a bit more efficient than 32-bit code from
the same program, but details vary according to the code in question.

One thing I notice from your post is that you are compiling without
enabling optimisation, which cripples the compiler's performance.
Enabling "-O2" will probably make your code several times faster (again,
without information on the program, I can only make general statements).
Different optimisation settings like "-Os", "-O3", and individual
optimisation flags may or may not make the code faster, but "-O2" is a
good start.

A good tip in GCC is to never go further than -O2

Going further at your own risk :)

The past 20 years or so, gcc actually never generated faster code for my 
chess software with -O3, usually it causes problems and slows down.

Kind Regards,
Vincent