Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?

Vincent Diepeveen <diep@xxxxxxxxx> · Wed, 26 Mar 2014 15:07:33 +0100 (CET)

On Wed, 26 Mar 2014, Florian Weimer wrote:

On 03/25/2014 04:51 PM, Vincent Diepeveen wrote:

a) for example if you use signed 32 bits indexation, for example

int i, array[64];

i = ...;
x = array[i];

this goes very fast in 32 bits processor and 32 bits mode yet a lot
slower in 64 bits mode, as i needs a sign extension to 64 bits.
So the compiler generates 1 additional instruction in 64 bits mode
to sign extend i from 32 bits to 64 bits.

Is this relevant in practice?  I'm asking because it's a missed optimization 
opportunity—negative subscripts lead to undefined behavior here, so the sign 
extension can be omitted.

Yes this is very relevant of course, as it is an instruction.
It all adds up you know. Now i don't know whether some modern processors 
can secretly internal fuse this - as about 99.9% of all C and C++ source 
codes in existance just use 'int' of course.

In the C specification in fact 'int' gets defined as the fastest possible 
datatype.

Well at x64 it is not. It's a lot slower if you use to index it. Factor 2 
slower to be precise, if you use it to index, as it generates another 
instruction.

If i write normal code, i simply use "int" and standardize upon that.

Writing for speed has not been made easier, because "int" still is a 32 
bits datatype whereas we have 64 bits processors nowadays.

Problem would be solved when 'sizeof(int)' suddenly is 8 bytes of course.

That would mean big refactoring of lots of codes though, yet one day we 
will need to go through that proces :)

I tend to remember that back in the days, sizeof(long) at DEC alpha was 8 
bytes already.

Now i'm not suggesting, not even indicating, this would be a wise change.

b) some processors can 'issue' more 32 bits instructions a clock than 64
bits instructions.

Some earlier processors also support more µop optimization in 32 bit mode.

I'm not a big expert on how the decoding and transport phase of processors 
nowadays works - it all has become so very complex.

Yet the decoding and delivery of the instructions is the bottleneck at 
todays processors. They all have plenty of execution units.

They just cannot decode+deliver enough bytes per clock.

My chessprogram Diep which is deterministic integer code (so no vector
codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64
bits than in 32 bits. This where it does use a few 64 bits datatypes
(very little though). In 64 bits the datasize used doesn't grow,
instruction wise it grows immense of course.

Well, chess programs used to be the prototypical example for 64 bit 
architectures ...

Only when a bunch of CIA related organisations got involved in funding a
 bunch of programs - as it's easier then to copy source code if you
write it for a sneaky organisation anyway.

The from origin top chess engines are all 32 bits based as they can 
execute 32 bits instructions faster of course and most mobile phones still 
are 32 bits anyway.

You cannot just cut and paste source codes from others and get away with 
it in a commercial setting.

Commercial seen that's too expensive to cut n paste 
other persons work because of all the courtcases, and you bet they will be 
there - just when governments get involved for first time in history i saw 
a bunch of guys work together who otherwise would stick out each others 
eyes at any given occasion :)

I made another chessprogram here a while ago which gets nearby 10 million 
nps single core. No 64 bits engine will ever manage that :)

Those extra instructions you can execute are deadly. And we're NOT 
speaking about vector instructions here - just integers.

The reason why 64 bits is interesting is not because it is any faster - it 
is not. It's slower in terms of executing instructions.

Yet algorithmically you can use a huge hashtable with all cores together, 
so that speeds you up bigtime then.

More than a decade ago i was happy to use 200 GB there at the SGI 
supercomputer. It really helps... ...not as much as some would guess it 
helps, yet a factor 2 really is a lot :)

Besides the above reasons another reason why 32 bits programs compiled
64 bits can be a lot slower in case of Diep is:

c) the larger code size causes more L1 instruction cache misses.

This really depends on the code.  Not everything is larger.  Typically it's 
the increased pointer size that cause increased data cache misses, which then 
casues slowdowns.

Really a lot changes to 64 bits of course, as
the above chesssoftware is mainly busy with array lookups and branches in 
between them.

You need those lookups everywhere. Arrays are really important. Not only 
as you want to lookup something, but also because they avoid writing
out another bunch of lines of codes to get to the same :)

Also the index into the array needs to be 64 bits of course. Which means 
that in the end every value gets converted to 64 bits in 64 bits mode, 
which makes sense.

Now i'm sure you define all array lookups as lookups into a pointer so 
we're on the same page then :)

Please also note that suddenly lots of branches in chessprograms also 
tend to get slower. Some in fact might go from say around a clock or 5 
penalty to 30 clocks penalty, because the distance in bytes between the 
conditional jump and the spot where it might jump to is more bytes away.

That you really feel bigtime.

GCC always has been worldchampion in rewriting branches to something that 
in advance is slower than the straightforward manner - and even the PGO 
phase couldn't improve upon that. Yet it especially slowed down most at 
AMD.

I tend to remember a discussion between a GCC guy and Linus there, 
where Linus said there was no excuse to not now and then generate CMOV's 
at modern processors like core2 and opteron - where the GCC teammember (a 
polish name i didn't recognize) argued that crippling GCC was needed as he 
owned a P4 :)

That was not long after i posted some similar code in forums showing how 
FUBAR gcc was with branches - yet "by accident" that got a 25-30 clocks 
penalty at AMD and not at intel.

That piece of code goes better nowadays.

Where GCC needs major improvements is in the PGO phase right now.
It's just abnormal difference. something like 3% speedup using pgo in GCC 
versus 20-25% speedup with other compilers under which intel c++.

I do not know what it causes - yet there should be tons of source codes 
available that have the same problem.

-- > Florian Weimer / Red Hat Product 
Security Team >