Hi Guys: Thanks for your reply. I am sorry missed test case. The attachment is test case. BestRegards 2014-03-26 22:07 GMT+08:00 Vincent Diepeveen <diep@xxxxxxxxx>: > > > On Wed, 26 Mar 2014, Florian Weimer wrote: > >> On 03/25/2014 04:51 PM, Vincent Diepeveen wrote: >> >>> a) for example if you use signed 32 bits indexation, for example >>> >>> int i, array[64]; >>> >>> i = ...; >>> x = array[i]; >>> >>> this goes very fast in 32 bits processor and 32 bits mode yet a lot >>> slower in 64 bits mode, as i needs a sign extension to 64 bits. >>> So the compiler generates 1 additional instruction in 64 bits mode >>> to sign extend i from 32 bits to 64 bits. >> >> >> Is this relevant in practice? I'm asking because it's a missed >> optimization opportunity--negative subscripts lead to undefined behavior >> here, so the sign extension can be omitted. > > > Yes this is very relevant of course, as it is an instruction. > It all adds up you know. Now i don't know whether some modern processors can > secretly internal fuse this - as about 99.9% of all C and C++ source codes > in existance just use 'int' of course. > > In the C specification in fact 'int' gets defined as the fastest possible > datatype. > > Well at x64 it is not. It's a lot slower if you use to index it. Factor 2 > slower to be precise, if you use it to index, as it generates another > instruction. > > If i write normal code, i simply use "int" and standardize upon that. > > Writing for speed has not been made easier, because "int" still is a 32 bits > datatype whereas we have 64 bits processors nowadays. > > Problem would be solved when 'sizeof(int)' suddenly is 8 bytes of course. > > That would mean big refactoring of lots of codes though, yet one day we will > need to go through that proces :) > > I tend to remember that back in the days, sizeof(long) at DEC alpha was 8 > bytes already. > > Now i'm not suggesting, not even indicating, this would be a wise change. > > >>> b) some processors can 'issue' more 32 bits instructions a clock than 64 >>> bits instructions. >> >> >> Some earlier processors also support more µop optimization in 32 bit mode. > > > I'm not a big expert on how the decoding and transport phase of processors > nowadays works - it all has become so very complex. > > Yet the decoding and delivery of the instructions is the bottleneck at > todays processors. They all have plenty of execution units. > > They just cannot decode+deliver enough bytes per clock. > > >>> My chessprogram Diep which is deterministic integer code (so no vector >>> codes) compiled 32 bits versus 64 bits is about 10%-12% slower in 64 >>> bits than in 32 bits. This where it does use a few 64 bits datatypes >>> (very little though). In 64 bits the datasize used doesn't grow, >>> instruction wise it grows immense of course. >> >> >> Well, chess programs used to be the prototypical example for 64 bit >> architectures ... > > > Only when a bunch of CIA related organisations got involved in funding a > bunch of programs - as it's easier then to copy source code if you > write it for a sneaky organisation anyway. > > The from origin top chess engines are all 32 bits based as they can execute > 32 bits instructions faster of course and most mobile phones still are 32 > bits anyway. > > You cannot just cut and paste source codes from others and get away with it > in a commercial setting. > > Commercial seen that's too expensive to cut n paste other persons work > because of all the courtcases, and you bet they will be there - just when > governments get involved for first time in history i saw a bunch of guys > work together who otherwise would stick out each others eyes at any given > occasion :) > > I made another chessprogram here a while ago which gets nearby 10 million > nps single core. No 64 bits engine will ever manage that :) > > Those extra instructions you can execute are deadly. And we're NOT speaking > about vector instructions here - just integers. > > The reason why 64 bits is interesting is not because it is any faster - it > is not. It's slower in terms of executing instructions. > > Yet algorithmically you can use a huge hashtable with all cores together, so > that speeds you up bigtime then. > > More than a decade ago i was happy to use 200 GB there at the SGI > supercomputer. It really helps... ...not as much as some would guess it > helps, yet a factor 2 really is a lot :) > > >>> Besides the above reasons another reason why 32 bits programs compiled >>> 64 bits can be a lot slower in case of Diep is: >>> >>> c) the larger code size causes more L1 instruction cache misses. >> >> >> This really depends on the code. Not everything is larger. Typically >> it's the increased pointer size that cause increased data cache misses, >> which then casues slowdowns. > > > Really a lot changes to 64 bits of course, as > the above chesssoftware is mainly busy with array lookups and branches in > between them. > > You need those lookups everywhere. Arrays are really important. Not only as > you want to lookup something, but also because they avoid writing > out another bunch of lines of codes to get to the same :) > > Also the index into the array needs to be 64 bits of course. Which means > that in the end every value gets converted to 64 bits in 64 bits mode, which > makes sense. > > Now i'm sure you define all array lookups as lookups into a pointer so we're > on the same page then :) > > Please also note that suddenly lots of branches in chessprograms also tend > to get slower. Some in fact might go from say around a clock or 5 penalty to > 30 clocks penalty, because the distance in bytes between the conditional > jump and the spot where it might jump to is more bytes away. > > That you really feel bigtime. > > GCC always has been worldchampion in rewriting branches to something that in > advance is slower than the straightforward manner - and even the PGO phase > couldn't improve upon that. Yet it especially slowed down most at AMD. > > I tend to remember a discussion between a GCC guy and Linus there, where > Linus said there was no excuse to not now and then generate CMOV's at modern > processors like core2 and opteron - where the GCC teammember (a polish name > i didn't recognize) argued that crippling GCC was needed as he owned a P4 :) > > That was not long after i posted some similar code in forums showing how > FUBAR gcc was with branches - yet "by accident" that got a 25-30 clocks > penalty at AMD and not at intel. > > That piece of code goes better nowadays. > > Where GCC needs major improvements is in the PGO phase right now. > It's just abnormal difference. something like 3% speedup using pgo in GCC > versus 20-25% speedup with other compilers under which intel c++. > > I do not know what it causes - yet there should be tons of source codes > available that have the same problem. > > > > > > >> -- > Florian Weimer / Red Hat Product > > Security Team >
#include <stdio.h> #include <unistd.h> #include <stdlib.h> int main(int args ,char* argv[]) { int count = atoi(argv[1]); int j = 0; FILE *fp; char msg[]={'x','y','z'}; char buf[20]; if((fp=fopen("/dev/zero","wb+"))==NULL) { printf("Cannot open file strike any key exit!"); exit(1); } fwrite(msg,sizeof(msg),1,fp); while( (count > 0 && j < count) || (count==0) ) { fread(buf,strlen(msg),1,fp); j++; if(j > count) break; } fclose(fp); return 0; }