> On Thu, 6 Aug 2009, Artur Skawina wrote: >> # TIME[s] SPEED[MB/s] >> rfc3174 1.357 44.99 >> rfc3174 1.352 45.13 >> mozilla 1.509 40.44 >> mozillaas 1.133 53.87 >> linus 0.5818 104.9 > #Initializing... Rounds: 1000000, size: 62500K, time: 1.421s, speed: 42.97MB/s > # TIME[s] SPEED[MB/s] > rfc3174 1.403 43.5 > # New hash result: b747042d9f4f1fdabd2ac53076f8f830dea7fe0f > rfc3174 1.403 43.51 > linus 0.5891 103.6 > linusas 0.5337 114.4 > mozilla 1.535 39.76 > mozillaas 1.128 54.13 I'm trying to absorb what you're learning about P4 performance, but I'm getting confused... what is what in these benchmarks? The major architectural decisions I see are: 1) Three possible ways to compute the W[] array for rounds 16..79: 1a) Compute W[16..79] in a loop beforehand (you noted that unrolling two copies helped significantly.) 1b) Compute W[16..79] as part of hash rounds 16..79. 1c) Compute W[0..15] in-place as part of hash rounds 16..79 2) The main hashing can be rolled up or unrolled: 2a) Four 20-round loops. (In case of options 1b and 1c, the first one might be split into a 16 and a 4.) 2b) Four 4-round loops, each unrolled 5x. (See the ARM assembly.) 2c) all 80 rounds unrolled. As Linus noted, 1c is not friends with options 2a and 2b, because the W() indexing math is not longer a compile-time constant. Linus has posted 1a+2c and 1c+2c. You posted some code that could be 2a or 2c depending on an UNROLL preprocessor #define. Which combinations are your "linus" and "linusas" code? You talk about "and my atom seems to like the compact loops too", but I'm not sure which loops those are. Thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html