Linus Torvalds wrote: > > Yeah, verified. Google for > > northwood "barrel shifter" > > and you'll find a lot of it. > > Basically, older P4's will I think shift one bit at a time. So while even > Prescott is relatively weak in the shifter department, pre-prescott > (Willamette and Northwood) are _really_ weak. If your P4 is one of those, > you really shouldn't use it to decide on optimizations. Actually that's even more of a reason to make sure the code doesn't suck :) The difference on less perverse cpus will usually be small, but on P4 it can be huge. A few years back I found my old ip checksum microbenchmark, and when I ran it on a P4 (prescott iirc) i didn't believe my eyes. The straightforward 32-bit C implementation was running circles around the in-kernel one... And a few tweaks to the assembler version got me another ~100% speedup.[1] After that the P4 became the very first cpu to test any code on... :) artur [1] just reran the benchmark on this p4; true on northwood too: IACCK 0.9.30 Artur Skawina <...> [ exec time; lower is better ] [speed ] [ time ] [ok?] TIME-N+S TIME32 TIME33 TIME1480 MBYTES/S TIMEXXXX CSUM FUNCTION ( rdtsc_overhead=0 null=0 ) 17901 510 557 3010 393.36 59772 56dd csum_partial_cdumb16 3019 154 156 431 2747.10 43106 56dd csum_partial_c32 2413 170 177 328 3609.76 37501 56dd csum_partial_c32l 2437 170 170 328 3609.76 37488 56dd csum_partial_c32i 5078 205 254 767 1543.68 48117 56dd csum_partial_std 5612 299 291 851 1391.30 53673 56dd csum_partial_686 1584 99 127 227 5215.86 14495 56dd csum_partial_586f 1738 107 121 229 5170.31 14785 56dd csum_partial_586fs 4893 175 171 759 1559.95 52347 56dd csum_partial_copy_generic_std 4949 151 189 756 1566.14 67847 56dd csum_partial_copy_generic_686 2072 110 134 302 3920.53 39061 56dd csum_partial_copy_generic_p4as1 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html