Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Wed, 26 Mar 2014, Florian Weimer wrote:

On 03/25/2014 04:51 PM, Vincent Diepeveen wrote:

a) for example if you use signed 32 bits indexation, for example

int i, array[64];

i = ...;
x = array[i];

this goes very fast in 32 bits processor and 32 bits mode yet a lot
slower in 64 bits mode, as i needs a sign extension to 64 bits.
So the compiler generates 1 additional instruction in 64 bits mode
to sign extend i from 32 bits to 64 bits.

Is this relevant in practice? I'm asking because it's a missed optimization opportunity—negative subscripts lead to undefined behavior here, so the sign extension can be omitted.

Yes this is very relevant of course, as it is an instruction.
It all adds up you know. Now i don't know whether some modern processors can secretly internal fuse this - as about 99.9% of all C and C++ source codes in existance just use 'int' of course.

In the C specification in fact 'int' gets defined as the fastest possible datatype.

Well at x64 it is not. It's a lot slower if you use to index it. Factor 2 slower to be precise, if you use it to index, as it generates another instruction.

If i write normal code, i simply use "int" and standardize upon that.

Writing for speed has not been made easier, because "int" still is a 32 bits datatype whereas we have 64 bits processors nowadays.

Problem would be solved when 'sizeof(int)' suddenly is 8 bytes of course.

That would mean big refactoring of lots of codes though, yet one day we will need to go through that proces :)

I tend to remember that back in the days, sizeof(long) at DEC alpha was 8 bytes already.

Now i'm not suggesting, not even indicating, this would be a wise change.

b) some processors can 'issue' more 32 bits instructions a clock than 64
bits instructions.

Some earlier processors also support more µop optimization in 32 bit mode.

I'm not a big expert on how the decoding and transport phase of processors nowadays works - it all has become so very complex.

Yet the decoding and delivery of the instructions is the bottleneck at todays processors. They all have plenty of execution units.

They just cannot decode+deliver enough bytes per clock.

My chessprogram Diep which is deterministic integer code (so no vector
codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64
bits than in 32 bits. This where it does use a few 64 bits datatypes
(very little though). In 64 bits the datasize used doesn't grow,
instruction wise it grows immense of course.

Well, chess programs used to be the prototypical example for 64 bit architectures ...

Only when a bunch of CIA related organisations got involved in funding a
 bunch of programs - as it's easier then to copy source code if you
write it for a sneaky organisation anyway.

The from origin top chess engines are all 32 bits based as they can execute 32 bits instructions faster of course and most mobile phones still are 32 bits anyway.

You cannot just cut and paste source codes from others and get away with it in a commercial setting.

Commercial seen that's too expensive to cut n paste other persons work because of all the courtcases, and you bet they will be there - just when governments get involved for first time in history i saw a bunch of guys work together who otherwise would stick out each others eyes at any given occasion :)

I made another chessprogram here a while ago which gets nearby 10 million nps single core. No 64 bits engine will ever manage that :)

Those extra instructions you can execute are deadly. And we're NOT speaking about vector instructions here - just integers.

The reason why 64 bits is interesting is not because it is any faster - it is not. It's slower in terms of executing instructions.

Yet algorithmically you can use a huge hashtable with all cores together, so that speeds you up bigtime then.

More than a decade ago i was happy to use 200 GB there at the SGI supercomputer. It really helps... ...not as much as some would guess it helps, yet a factor 2 really is a lot :)

Besides the above reasons another reason why 32 bits programs compiled
64 bits can be a lot slower in case of Diep is:

c) the larger code size causes more L1 instruction cache misses.

This really depends on the code. Not everything is larger. Typically it's the increased pointer size that cause increased data cache misses, which then casues slowdowns.

Really a lot changes to 64 bits of course, as
the above chesssoftware is mainly busy with array lookups and branches in between them.

You need those lookups everywhere. Arrays are really important. Not only as you want to lookup something, but also because they avoid writing
out another bunch of lines of codes to get to the same :)

Also the index into the array needs to be 64 bits of course. Which means that in the end every value gets converted to 64 bits in 64 bits mode, which makes sense.

Now i'm sure you define all array lookups as lookups into a pointer so we're on the same page then :)

Please also note that suddenly lots of branches in chessprograms also tend to get slower. Some in fact might go from say around a clock or 5 penalty to 30 clocks penalty, because the distance in bytes between the conditional jump and the spot where it might jump to is more bytes away.

That you really feel bigtime.

GCC always has been worldchampion in rewriting branches to something that in advance is slower than the straightforward manner - and even the PGO phase couldn't improve upon that. Yet it especially slowed down most at AMD.

I tend to remember a discussion between a GCC guy and Linus there, where Linus said there was no excuse to not now and then generate CMOV's at modern processors like core2 and opteron - where the GCC teammember (a polish name i didn't recognize) argued that crippling GCC was needed as he owned a P4 :)

That was not long after i posted some similar code in forums showing how FUBAR gcc was with branches - yet "by accident" that got a 25-30 clocks penalty at AMD and not at intel.

That piece of code goes better nowadays.

Where GCC needs major improvements is in the PGO phase right now.
It's just abnormal difference. something like 3% speedup using pgo in GCC versus 20-25% speedup with other compilers under which intel c++.

I do not know what it causes - yet there should be tons of source codes available that have the same problem.





-- > Florian Weimer / Red Hat Product
Security Team >

[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux