On Tue, 25 Mar 2014, David Brown wrote:
On 25/03/14 04:31, Xinrong Fu wrote:
Hi guys:
What does the number of stalled cycles in the CPU pipeline frontend
means? Why is the stalled frontend cycles of 32bit program more than
64bit program's stalled cycles when they running on same 64bit system?
Is there any gcc options to fix it?
If the question is: Why a 32 bits program compiled
64 bits is a lot slower:
There can be several reasons yet let's name a few.
a) for example if you use signed 32 bits indexation, for example
int i, array[64];
i = ...;
x = array[i];
this goes very fast in 32 bits processor and 32 bits mode yet a lot slower
in 64 bits mode, as i needs a sign extension to 64 bits.
So the compiler generates 1 additional instruction in 64 bits mode
to sign extend i from 32 bits to 64 bits.
b) some processors can 'issue' more 32 bits instructions a clock than 64
bits instructions. This can have many reasons, for example the processor
can just decode a limited amount of bytes per clock and as 32 bits
instructions occupy less space that means they can decode 4 instructions
of 32 bits yet just 3 of 64 bits. Please note: not taking into account
vector instructions here, just seeing an instruction as an
absolute instruction here and not taking into account how wide the
register is upon which it operates.
Agner Fog has more exact measurements on how little bytes modern
processors can actually decode a clock.
My chessprogram Diep which is deterministic integer code (so no
vector codes) compiled 32 bits versus 64 bits is about 10%-12% slower in
64 bits than in 32 bits. This where it does use a few 64 bits datatypes (very
little though). In 64 bits the datasize used doesn't grow, instruction
wise it grows immense of course.
Besides the above reasons another reason why 32 bits programs compiled 64
bits can be a lot slower in case of Diep is:
c) the larger code size causes more L1 instruction cache misses.
And that i a major problem especially as those L1i's are already so tiny
at modern processors.
d) gcc is total horrible in optimizing branches. Where a compiler like
intel c++ easily gets 20%-25% performance out of the PGO (profile guided
optimization), gcc gets total peanuts out of the pgo phase for my
chessprogram. 3% or so.
This all has to do with how it deals with branches and the horrible
optimizations that trigger.
Now these horrors you could either benefit from going to 64
bits, as it no longer uses that horror by then, or
get additional penalty from moving to 64 bits. That last case for example
at an older generation AMD processor when the jump suddenly is outside of
what the processor is seeing in its lookahead, causing a huge penalty
suddenly for a branch mispredict.
It largely depends upon the processor you have, especially older types AMD
processors suffer there.
So moving there to 64 bits could speed you up occasional, even when not
using any optimization at all, just because some of the old FUBAR codes
used for 32 bits no longer get triggered.
Are you asking why the same program runs faster when compiled as 64-bit
rather than 32-bit? There are /many/ reasons why 64-bit x86 code can be
faster than 32-bit x86 code - without having any idea about your code,
we can only make general points. In comparison to 32-bit x86, the
64-bit mode has access to more registers,
Usually processors are optimized to just use a few registers whereas
they use all sorts of tricks (where additional registers get used) already
to make up for it, so the additional registers hardly is an
advantage of any kind in 64 bits, not even in algorithmic codes here.
In tests performed using more registers using assembler code the
processors actually slow down. So there is a performance benefit in
reusing the same few registers over and over again.
This performance penalty of using more registers is not only there in x64,
it already was the case in x86 processors. In fact it's easy to measure in
the pentium from 2 decades ago already.
Hopefully that'll change a tad in the future - yet i consider that
unlikely, as it also would involve changes in the intel c++ compiler.
has wider registers (which
Exactly:
If you use 64 bits datatypes like "long long" then obviously 64 bits is
a huge advantage over 32 bits. This can easily give a factor 2 speed
improvement in case of integer codes that are 64 bits.
speeds data movement), less complicated instruction decoding and
instruction prefixes, more efficient floating point, and much more
efficient calling conventions. It has the disadvantage that pointers
take up twice as much data cache and memory bandwidth, as they are twice
the size.
From a distance seen you're totally correct here that caches are the
problem. To zoom in: the larger pointer is more of a problem for the
instruction part of the cache.
In itself the larger pointer doesn't mean the size the data occupies in
the datacache grows.
Yet in the compiler in 64 bits needs more instructions to get to the 32
bits data and such 64 bits pointer instructions simply are larger laying
more stress upon the instruction decoding/transport, whereas we already
know it can just decode 3 hands of bytes a clock.
Now for a lot of programs this isn't a big problem as another bitwise AND
is a very fast instruction, yet for my software which is pretty optimized
i feel additional instructions not in the last place as it makes the
already supertiny L1 instruction cache ugh out even more and as the IPC
already is very high :)
As for gcc options to "fix" it, there is no problem to fix - it is
normal that 64-bit code is a bit more efficient than 32-bit code from
the same program, but details vary according to the code in question.
One thing I notice from your post is that you are compiling without
enabling optimisation, which cripples the compiler's performance.
Enabling "-O2" will probably make your code several times faster (again,
without information on the program, I can only make general statements).
Different optimisation settings like "-Os", "-O3", and individual
optimisation flags may or may not make the code faster, but "-O2" is a
good start.
A good tip in GCC is to never go further than -O2
Going further at your own risk :)
The past 20 years or so, gcc actually never generated faster code for my
chess software with -O3, usually it causes problems and slows down.
Kind Regards,
Vincent