* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Tue, Mar 22, 2011 at 3:27 AM, Ingo Molnar <mingo@xxxxxxx> wrote: > > > > If that situation has changed - if GCC has regressed in this area then a commit > > changing the default IMHO gains a lot of credibility if it is backed by careful > > measurements using perf stat --repeat or similar tools. > > Also, please don't back up any numbers for the "-O2 is faster than > -Os" case with some benchmark that is hot in the caches. > > The thing is, many optimizations that make the code larger look really > good if there are no cache misses, and the code is run a million times > in a tight loop. > > But kernel code in particular tends to not be like that. [...] To throw some numbers into the discussion, here's the size versus speed comparison for 'hackbench 15' - which is more on the microbenchmark side of the equation - but has macrobenchmark properties as well, because it runs 3000 tasks and moves a lot of data, hence thrashes the caches constantly: CONFIG_CC_OPTIMIZE_FOR_SIZE=y ---------------------------------------- 6,757,858,145 cycles # 2525.983 M/sec ( +- 0.388% ) 2,949,907,036 instructions # 0.437 IPC ( +- 0.191% ) 595,955,367 branches # 222.759 M/sec ( +- 0.238% ) 31,504,981 branch-misses # 5.286 % ( +- 0.187% ) 0.164320722 seconds time elapsed ( +- 0.524% ) # CONFIG_CC_OPTIMIZE_FOR_SIZE is not set ---------------------------------------- 6,061,867,073 cycles # 2510.283 M/sec ( +- 0.494% ) 2,510,505,732 instructions # 0.414 IPC ( +- 0.243% ) 493,721,089 branches # 204.455 M/sec ( +- 0.302% ) 38,731,708 branch-misses # 7.845 % ( +- 0.206% ) 0.148203574 seconds time elapsed ( +- 0.673% ) They were perf stat --repeat 100 runs - repeated a couple of times to make sure it's all real. I have used GCC 4.6.0, a relatively recent compiler. (64-bit x86, typical .config, etc.) The text size differences: text data bss dec filename ------------------------------------------------------------------------- 8809558 1790428 2719744 13319730 vmlinux.optimize_for_size 10268082 1825292 2727936 14821310 vmlinux.optimize_for_speed So by enabling CONFIG_CC_OPTIMIZE_FOR_SIZE=y, we get this total effect: -16.5% text size reduction +17.5% instruction count increase +20.7% branches executed increase -22.9% branch-miss reduction +11.5% cycle count increase +10.8% total runtime increase A few observations: - the branch-miss reduction suggests that almost none of the new branches introduced by -Os generates a branch miss. - the cycles count increase is in line with the total runtime increase. - workloads where 16.5% more instruction cache footprint slows down the workload by more than ~11% would win from enabling CONFIG_CC_OPTIMIZE_FOR_SIZE=y. Looking at these numbers i became more pessimistic about the usefulness of the current implementation of CONFIG_CC_OPTIMIZE_FOR_SIZE=y - it would need some *serious* icache thrashing to cause a larger than 11% slowdown, right? I'm not sure what the best way would be to measure a realistic macro workloads where the kernel's instructions generate a lot of instruction-cache misses. Most of the 'real' workloads tend to be hard to measure precisely, tend to be very noisy and take a long time to run. I could perhaps try to simulate them: i could patch a debug-only 'icache flusher' function into every system call, and compare the perf stat results - would that be an acceptable simulation of cache-cold kernel execution? The 'icache flusher' would be something simple, like 10,000x 5-byte NOP instructions in a row, or so. This would slow things down immensely, but this particular slowdown is the same for both OPTIMIZE_FOR_SIZE=y and OPTIMIZE_FOR_SIZE=n. Any better ideas? Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kbuild" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html