On Tue, 11 Aug 2009, Linus Torvalds wrote: > On Tue, 11 Aug 2009, Nicolas Pitre wrote: > > > > BLK_SHA1: 5.280s [original] > > BLK_SHA1: 7.410s [with SMALL_REGISTER_SET defined] > > BLK_SHA1: 7.480s [with 'W(x)=(val);asm("":"+m" (W(x)))'] > > BLK_SHA1: 4.980s [with 'W(x)=(val);asm("":::"memory")'] > > > > At this point the generated assembly is pretty slick. I bet the full > > memory barrier might help on x86 as well. > > No, I had tested that earlier - single-word memory barrier for some reason > gets _much_ better numbers at least on x86-64. We're talking > > linus 1.46 418.2 > vs > linus 2.004 304.6 > > kind of differences. With the "+m" it outperforms openssl (375-380MB/s). > > The "volatile unsigned int *" cast looks pretty much like the "+m" version > to me, but Arthur got a speedup from whatever gcc code generation > differences on his P4. The volatile pointer forces a write to memory but the cached value in the processor's register remains valid, whereas the "+m" forces gcc to assume the register copy is not valid anymore. That certainly gives the compiler a different clue about register availability, etc. > The really fundamental and basic problem with gcc on this code is that gcc > does not see _any_ difference what-so-ever between the five variables > declared with > > unsigned int A, B, C, D, E; > > and the sixteen variables declared with > > unsigned int array[16]; > > and considers those all to be 21 local variables. It really seems to think > that they are all 100% equivalent, and gcc totally ignores me doing things > like adding "register" to the A-E ones etc. > > And if you are a compiler, and think that the routine has 21 equal > register variables, you're going to do crazy reload sh*t when you have > only 7 (or 15) GP registers. So doing that full memory barrier seems to > just take that random situation, and force some random variable to be > spilled (this is all from looking at the generated code, not from looking > at gcc). > > In contrast, with the _targeted_ thing ("you'd better write back into > array[]") we force gcc to spill the array[16] values, and not the A-E > ones, and that's why it seems to make such a big difference. > > And no, I'm not sure why ARM apparently doesn't show the same behavior. Or > maybe it does, but with an in-order core it doesn't matter as much which > registers you keep reloading - you'll be serialized all the time _anyway_. Well... gcc is really strange in this case (and similar other ones) with ARM compilation. A good indicator of the quality of the code is the size of the stack frame. When using the "+m" then gcc creates a 816 byte stack frame, the generated binary grows by approx 3000 bytes, and performances is almost halved (7.600s). Looking at the assembly result I just can't figure out all the crazy moves taking place. Even the version with no barrier what so ever produces better assembly with a stack frame of 560 bytes. The volatile version is second to worst with a 808 byte stack frame with similar bad performances. With the full "memory" the stack frame shrinks to 280 bytes and best performances so far is obtained. And none of the A B C D E or data variables are ever spilled onto the stack either, only the array[16] gets allocated stack slots, and the TEMP variable. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html