Re: [PATCH v3 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation

Waiman Long <waiman.long@xxxxxx> · Fri, 31 Jan 2014 14:24:33 -0500

On 01/31/2014 10:08 AM, Peter Zijlstra wrote:
On Tue, Jan 28, 2014 at 01:19:10PM -0500, Waiman Long wrote:
For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

   Lock Type			Time (ns)
   ---------			---------
   Ticket spinlock		  14.1
   Queue spinlock (Normal)	   8.8*
What CONFIG_NR_CPUS ?

I was testing on a RHEL6.4 system which has a CONFIG_NR_CPUS of 4096.

Because for CONFIG_NR_CPUS<  128 (or 256 if you got !PARAVIRT), the fast
path code should be:

ticket:

   mov $0x100,eax
   lock xadd %ax,(%rbx)
   cmp %al,%ah
   jne ...

although my GCC is being silly and writes:

   mov $0x100,eax
   lock xadd %ax,(%rbx)
   movzbl %ah,%edx
   cmp %al,%dl
   jne ...

Which seems rather like a waste of a perfectly good cycle.

With a bigger NR_CPUS you do indeed need more ops:

   mov $0x10000,%edx
   lock xadd %edx,(%rbx)
   mov %edx,%ecx
   shr $0x10,%ecx
   cmp %dx,%cx
   jne ...

Whereas for the straight cmpxchg() you'd get something relatively simple
like:

   mov %edx,%eax
   lock cmpxchg %ecx,(%rbx)
   cmp %edx,%eax
   jne ...

I believe the speeds of the lock functions are about the same. However, 
qspinlock has a much simpler unlock function which probably account of 
most of the speed gain.

Anyway, as soon as you get some (light) contention you're going to tank
because you have to pull in extra cachelines, which is sad.

Light contention is the only case where the qspinlock may not perform as 
good as the ticket spinlock. I know this is the most common case. 
However, I would argue that the slowdown, if any, will not be really 
noticeable. This is what I will try to find out.

I suppose we could from the ticket code more and optimize the
uncontended path, but that'll make the contended path more expensive
again, although probably not as bad as hitting a new cacheline.

I don't get what you are trying to say.

Right now, I am using only bit 0 as a lock bit. I can use  bit 4, for 
instance, as a pending locker bit and spin until bit 0 is clear. So if 
there is only 1 other task spinning, it won't need to  fetch another 
cacheline. However, it will slow down the uncontended path as I can't 
assign a 0 byte to free the lock. I have to use an atomic subtraction or 
clear bit instead.

-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html