Re: Why is the performance of 32bit program worse than 64bit program running on the same 64bit system, They are compiled from same source. Which gcc option can fix it?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Guys:
     Thanks for your reply. I am sorry missed test case.
The attachment is test case.

BestRegards


2014-03-26 22:07 GMT+08:00 Vincent Diepeveen <diep@xxxxxxxxx>:
>
>
> On Wed, 26 Mar 2014, Florian Weimer wrote:
>
>> On 03/25/2014 04:51 PM, Vincent Diepeveen wrote:
>>
>>> a) for example if you use signed 32 bits indexation, for example
>>>
>>> int i, array[64];
>>>
>>> i = ...;
>>> x = array[i];
>>>
>>> this goes very fast in 32 bits processor and 32 bits mode yet a lot
>>> slower in 64 bits mode, as i needs a sign extension to 64 bits.
>>> So the compiler generates 1 additional instruction in 64 bits mode
>>> to sign extend i from 32 bits to 64 bits.
>>
>>
>> Is this relevant in practice?  I'm asking because it's a missed
>> optimization opportunity--negative subscripts lead to undefined behavior
>> here, so the sign extension can be omitted.
>
>
> Yes this is very relevant of course, as it is an instruction.
> It all adds up you know. Now i don't know whether some modern processors can
> secretly internal fuse this - as about 99.9% of all C and C++ source codes
> in existance just use 'int' of course.
>
> In the C specification in fact 'int' gets defined as the fastest possible
> datatype.
>
> Well at x64 it is not. It's a lot slower if you use to index it. Factor 2
> slower to be precise, if you use it to index, as it generates another
> instruction.
>
> If i write normal code, i simply use "int" and standardize upon that.
>
> Writing for speed has not been made easier, because "int" still is a 32 bits
> datatype whereas we have 64 bits processors nowadays.
>
> Problem would be solved when 'sizeof(int)' suddenly is 8 bytes of course.
>
> That would mean big refactoring of lots of codes though, yet one day we will
> need to go through that proces :)
>
> I tend to remember that back in the days, sizeof(long) at DEC alpha was 8
> bytes already.
>
> Now i'm not suggesting, not even indicating, this would be a wise change.
>
>
>>> b) some processors can 'issue' more 32 bits instructions a clock than 64
>>> bits instructions.
>>
>>
>> Some earlier processors also support more µop optimization in 32 bit mode.
>
>
> I'm not a big expert on how the decoding and transport phase of processors
> nowadays works - it all has become so very complex.
>
> Yet the decoding and delivery of the instructions is the bottleneck at
> todays processors. They all have plenty of execution units.
>
> They just cannot decode+deliver enough bytes per clock.
>
>
>>> My chessprogram Diep which is deterministic integer code (so no vector
>>> codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64
>>> bits than in 32 bits. This where it does use a few 64 bits datatypes
>>> (very little though). In 64 bits the datasize used doesn't grow,
>>> instruction wise it grows immense of course.
>>
>>
>> Well, chess programs used to be the prototypical example for 64 bit
>> architectures ...
>
>
> Only when a bunch of CIA related organisations got involved in funding a
>  bunch of programs - as it's easier then to copy source code if you
> write it for a sneaky organisation anyway.
>
> The from origin top chess engines are all 32 bits based as they can execute
> 32 bits instructions faster of course and most mobile phones still are 32
> bits anyway.
>
> You cannot just cut and paste source codes from others and get away with it
> in a commercial setting.
>
> Commercial seen that's too expensive to cut n paste other persons work
> because of all the courtcases, and you bet they will be there - just when
> governments get involved for first time in history i saw a bunch of guys
> work together who otherwise would stick out each others eyes at any given
> occasion :)
>
> I made another chessprogram here a while ago which gets nearby 10 million
> nps single core. No 64 bits engine will ever manage that :)
>
> Those extra instructions you can execute are deadly. And we're NOT speaking
> about vector instructions here - just integers.
>
> The reason why 64 bits is interesting is not because it is any faster - it
> is not. It's slower in terms of executing instructions.
>
> Yet algorithmically you can use a huge hashtable with all cores together, so
> that speeds you up bigtime then.
>
> More than a decade ago i was happy to use 200 GB there at the SGI
> supercomputer. It really helps... ...not as much as some would guess it
> helps, yet a factor 2 really is a lot :)
>
>
>>> Besides the above reasons another reason why 32 bits programs compiled
>>> 64 bits can be a lot slower in case of Diep is:
>>>
>>> c) the larger code size causes more L1 instruction cache misses.
>>
>>
>> This really depends on the code.  Not everything is larger.  Typically
>> it's the increased pointer size that cause increased data cache misses,
>> which then casues slowdowns.
>
>
> Really a lot changes to 64 bits of course, as
> the above chesssoftware is mainly busy with array lookups and branches in
> between them.
>
> You need those lookups everywhere. Arrays are really important. Not only as
> you want to lookup something, but also because they avoid writing
> out another bunch of lines of codes to get to the same :)
>
> Also the index into the array needs to be 64 bits of course. Which means
> that in the end every value gets converted to 64 bits in 64 bits mode, which
> makes sense.
>
> Now i'm sure you define all array lookups as lookups into a pointer so we're
> on the same page then :)
>
> Please also note that suddenly lots of branches in chessprograms also tend
> to get slower. Some in fact might go from say around a clock or 5 penalty to
> 30 clocks penalty, because the distance in bytes between the conditional
> jump and the spot where it might jump to is more bytes away.
>
> That you really feel bigtime.
>
> GCC always has been worldchampion in rewriting branches to something that in
> advance is slower than the straightforward manner - and even the PGO phase
> couldn't improve upon that. Yet it especially slowed down most at AMD.
>
> I tend to remember a discussion between a GCC guy and Linus there, where
> Linus said there was no excuse to not now and then generate CMOV's at modern
> processors like core2 and opteron - where the GCC teammember (a polish name
> i didn't recognize) argued that crippling GCC was needed as he owned a P4 :)
>
> That was not long after i posted some similar code in forums showing how
> FUBAR gcc was with branches - yet "by accident" that got a 25-30 clocks
> penalty at AMD and not at intel.
>
> That piece of code goes better nowadays.
>
> Where GCC needs major improvements is in the PGO phase right now.
> It's just abnormal difference. something like 3% speedup using pgo in GCC
> versus 20-25% speedup with other compilers under which intel c++.
>
> I do not know what it causes - yet there should be tons of source codes
> available that have the same problem.
>
>
>
>
>
>
>> -- > Florian Weimer / Red Hat Product
>
> Security Team >
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

int main(int args ,char* argv[])
{
        int count = atoi(argv[1]);
        int j = 0;
        FILE *fp;
        char msg[]={'x','y','z'};
        char buf[20];
        
        if((fp=fopen("/dev/zero","wb+"))==NULL)
        {
            printf("Cannot open file strike any key exit!");
            exit(1);
        }
        
        fwrite(msg,sizeof(msg),1,fp);
        
        while( (count > 0 && j < count) || (count==0) )
        {
            fread(buf,strlen(msg),1,fp);
            j++;
            if(j > count)
                break;
        }
        
        fclose(fp); 
        
        return 0;
}


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux