> You can save yourself this MOV here in what is, I'm assuming, the > general likely case where @src is aligned and do: > > /* check for bad alignment of source */ > testl $7, %esi > /* already aligned? */ > jz 102f > > movl %esi,%ecx > subl $8,%ecx > negl %ecx > subl %ecx,%edx > 0: movb (%rsi),%al > movb %al,(%rdi) > incq %rsi > incq %rdi > decl %ecx > jnz 0b The "testl $7, %esi" just checks the low three bits ... it doesn't change %esi. But the code from the "subl $8" on down assumes that %ecx is a number in [1..7] as the count of bytes to copy until we achieve alignment. So your "movl %esi,%ecx" needs to be somthing that just copies the low three bits and zeroes the high part of %ecx. Is there a cute way to do that in x86 assembler? > Why aren't we pushing %r12-%r15 on the stack after the "jz 17f" above > and using them too and thus copying a whole cacheline in one go? > > We would need to restore them when we're done with the cacheline-wise > shuffle, of course. I copied that loop from arch/x86/lib/copy_user_64.S:__copy_user_nocache() I guess the answer depends on whether you generally copy enough cache lines to save enough time to cover the cost of saving and restoring those registers. -Tony -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>