Re: [PATCH v3 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/31/2014 10:08 AM, Peter Zijlstra wrote:
On Tue, Jan 28, 2014 at 01:19:10PM -0500, Waiman Long wrote:
For single-thread performance (no contention), a 256K lock/unlock
loop was run on a 2.4Ghz Westmere x86-64 CPU.  The following table
shows the average time (in ns) for a single lock/unlock sequence
(including the looping and timing overhead):

   Lock Type			Time (ns)
   ---------			---------
   Ticket spinlock		  14.1
   Queue spinlock (Normal)	   8.8*
What CONFIG_NR_CPUS ?

I was testing on a RHEL6.4 system which has a CONFIG_NR_CPUS of 4096.


Because for CONFIG_NR_CPUS<  128 (or 256 if you got !PARAVIRT), the fast
path code should be:

ticket:

   mov $0x100,eax
   lock xadd %ax,(%rbx)
   cmp %al,%ah
   jne ...

although my GCC is being silly and writes:

   mov $0x100,eax
   lock xadd %ax,(%rbx)
   movzbl %ah,%edx
   cmp %al,%dl
   jne ...

Which seems rather like a waste of a perfectly good cycle.

With a bigger NR_CPUS you do indeed need more ops:

   mov $0x10000,%edx
   lock xadd %edx,(%rbx)
   mov %edx,%ecx
   shr $0x10,%ecx
   cmp %dx,%cx
   jne ...


Whereas for the straight cmpxchg() you'd get something relatively simple
like:

   mov %edx,%eax
   lock cmpxchg %ecx,(%rbx)
   cmp %edx,%eax
   jne ...

I believe the speeds of the lock functions are about the same. However, qspinlock has a much simpler unlock function which probably account of most of the speed gain.

Anyway, as soon as you get some (light) contention you're going to tank
because you have to pull in extra cachelines, which is sad.

Light contention is the only case where the qspinlock may not perform as good as the ticket spinlock. I know this is the most common case. However, I would argue that the slowdown, if any, will not be really noticeable. This is what I will try to find out.


I suppose we could from the ticket code more and optimize the
uncontended path, but that'll make the contended path more expensive
again, although probably not as bad as hitting a new cacheline.

I don't get what you are trying to say.

Right now, I am using only bit 0 as a lock bit. I can use bit 4, for instance, as a pending locker bit and spin until bit 0 is clear. So if there is only 1 other task spinning, it won't need to fetch another cacheline. However, it will slow down the uncontended path as I can't assign a 0 byte to free the lock. I have to use an atomic subtraction or clear bit instead.

-Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Kernel]     [Kernel Newbies]     [x86 Platform Driver]     [Netdev]     [Linux Wireless]     [Netfilter]     [Bugtraq]     [Linux Filesystems]     [Yosemite Discussion]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]

  Powered by Linux