Hi, I found _spin_lock used a LOCK instruction to make the following operation "decb %0" atomic. As you know, LOCK instruction alone takes almost 70 clock cycles to finish and this add lots of cost to the _spin_lock. However _spin_unlock does not use this LOCK instruction and it uses "movb $1,%0" instead since 4-byte writes on 4-byte aligned addresses are atomic. So I want rewrite the _spin_lock defined spinlock.h (/linux/include/asm-i386) as follows to reduce the overhead of _spin_lock and make it more efficient. #define spin_lock_string \ "\n1:\t" \ "cmpb $0,%0\n\t" \ "jle 2f\n\t" \ "movb $0, %0\n\t" \ "jmp 3f\n" \ "2:\t" \ "rep;nop\n\t" \ "cmpb $0, %0\n\t" \ "jle 2b\n\t" \ "jmp 1b\n" \ "3:\n\t" Compared with the original version as follows, LOCK instruction is removed. I rebuilt the Intel e1000 Gigabit driver with this _spin_lock. There is about 2% throughput improvement. #define spin_lock_string \ "\n1:\t" \ "lock ; decb %0\n\t" \ "jns 3f\n" \ "2:\t" \ "rep;nop\n\t" \ "cmpb $0,%0\n\t" \ "jle 2b\n\t" \ "jmp 1b\n" \ "3:\n\t" Do you think I can get a better performance if I dig further? Any ideas will be greatly appreciated, L.Y. - : send the line "unsubscribe linux-net" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html