On Thu, Apr 30, 2009, Steven Noonan <steven@xxxxxxxxxxxxxx> wrote: > A bit off topic, but the results are rather interesting to me, and I > think I see a weakness in how GCC is doing this on Intel. Someone > please correct me if I'm wrong, but the PowerPC code seems much better > because it can yield very high instruction-level parallelism. It does > 5 loads and then 5 stores, using 4 registers for temporary storage and > 2 registers for pointers. > > I realize the Intel x86 architecture is quite constrained in that it > has so few general purpose registers, but there has to be better code > than what GCC emitted above. It seems like the processor would stall > because of the quantity of sequential inter-dependent instructions > that can't be done in parallel (mov to memory that depends on a mov to > eax, etc). There aren't any unnecessary dependencies. Take this sequence: 1: movl (%edx), %eax 2: movl %eax, (%ecx) 3: movl 4(%edx), %eax 4: movl %eax, 4(%ecx) There are two unavoidable dependencies - #2 depends on #1, and #4 depends on #3. #3 does not depend on #2, even though they both use %eax, because #3 is a write to %eax. So whatever was in %eax before #3 is irrelevant. The processor knows this and will use register renaming to execute #1 and #3 in parallel, and #2 and #4 in parallel. James -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html