On Mon, 10 Aug 2009, Linus Torvalds wrote: > > For x86 performance (especially in 32-bit mode) I added that hack to write > the SHA1 internal temporary hash using a volatile pointer, in order to get > gcc to not try to cache the array contents. Because gcc will do all the > wrong things, and then spill things in insane random ways. > > But on architectures like PPC, where you have 32 registers, it's actually > perfectly reasonable to put the whole temporary array[] into the register > set, and gcc can do so. > > So make the 'volatile unsigned int *' cast be dependent on a > SMALL_REGISTER_SET preprocessor symbol, and enable it (currently) on just > x86 and x86-64. With that, the routine is fairly reasonable even when > compared to the hand-scheduled PPC version. Ben Herrenschmidt reports on > a G5: > > * Paulus asm version: about 3.67s > * Yours with no change: about 5.74s > * Yours without "volatile": about 3.78s > > so with this the C version is within about 3% of the asm one. > > And add a lot of commentary on what the heck is going on. > > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> > --- > > I also asked David Miller to test the non-volatile version on Sparc, but I > suspect it will have the same pattern. ia64 likewise (but I have not asked > anybody to test). > > Of the other architectures, ARM probably would wants SMALL_REGISTER_SET, > but I suspect the problem there is the htonl() (on little-endian), and > possibly the unaligned loads - at least on older ARM. The latter is > something gcc could be taught about, though (the SHA_SRC macro would just > need to use a pointer that goes through a packed struct member or > something). The "older" ARM (those that don't perform unaligned accesses in hardware) are still the majority by far in the field. Here some numbers on ARM for 203247018 bytes. MOZILLA_SHA1: 14.520s ARM_SHA1: 5.600s OPENSSL: 5.530s BLK_SHA1: 5.280s [original] BLK_SHA1: 7.410s [with SMALL_REGISTER_SET defined] BLK_SHA1: 7.480s [with 'W(x)=(val);asm("":"+m" (W(x)))'] BLK_SHA1: 4.980s [with 'W(x)=(val);asm("":::"memory")'] At this point the generated assembly is pretty slick. I bet the full memory barrier might help on x86 as well. However the above BLK_SHA1 works only for aligned source buffers. So let's define our own SHA_SRC to replace the htonl() (which should probably be ntohl() by the way) like this: #define SHA_SRC(t) \ ({ unsigned char *__d = (unsigned char *)&data[t]; \ (__d[0] << 24) | (__d[1] << 16) | (__d[2] << 8) | (__d[3] << 0); }) And this provides the exact same performance as the ntohl() based version (4.980s) except that this now cope with unaligned buffers too. Of course the BLK_SHA1 version is a pig since it is totally unrolled text data bss dec hex filename 1220 0 0 1220 4c4 mozilla-sha1/sha1.o 852 0 0 852 354 arm/sha1_arm.o 6292 0 0 6292 1894 block-sha1/sha1.o so the speed advantage has a significant (but relative) code size cost. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html