Re: [PATCH v5 1/6] qspinlock: powerpc support qspinlock

xinhui <xinhui.pan@xxxxxxxxxxxxxxxxxx> · Tue, 21 Jun 2016 20:35:22 +0800

On 2016年06月07日 05:41, Benjamin Herrenschmidt wrote:
On Mon, 2016-06-06 at 17:59 +0200, Peter Zijlstra wrote:
On Fri, Jun 03, 2016 at 02:33:47PM +1000, Benjamin Herrenschmidt wrote:

  - For the above, can you show (or describe) where the qspinlock
    improves things compared to our current locks.
So currently PPC has a fairly straight forward test-and-set spinlock
IIRC. You have this because LPAR/virt muck and lock holder preemption
issues etc..
qspinlock is 1) a fair lock (like ticket locks) and 2) provides
out-of-word spinning, reducing cacheline pressure.

Thanks Peter. I think I understand the theory, but I'd like see it
translate into real numbers.

Esp. on multi-socket x86 we saw the out-of-word spinning being a big win
over our ticket locks.

And fairness, brought to us by the ticket locks a long time ago,
eliminated starvation issues we had, where a spinner local to the holder
would 'always' win from a spinner further away. So under heavy enough
local contention, the spinners on 'remote' CPUs would 'never' get to own
the lock.

I think our HW has tweaks to avoid that from happening with the simple
locks in the underlying ll/sc implementation. In any case, what I'm
asking is actual tests to verify it works as expected for us.

IF HW has such tweaks then there mush be performance drop when total cpu's number grows up.
And I got such clues

one simple benchmark test:
it tests how many spin_lock/spin_unlock pairs can be done within 15 seconds on all cpus.
say,
while(!done) {
	spin_lock()
	this_cpu_inc(loops)
	spin_unlock()
}

I do the test on two machines, one is using powerKVM, and the other is using pHyp.
the result below shows what the sum of loops is in the end, with xxxxK form.

cpu count	| pv-qspinlock	| test-set spinlock|
----------------------------------------------------
8 (powerKVM)	|	62830K	|	67340K	|
------------------------------------------------
8 (pHyp)	|	49800K	|	59330K	|
------------------------------------------------
32 (pHyp)	|	87580K	|	20990K	|
-------------------------------------------------

while cpu count grows up, the lock/unlock pairs ops of test-set spinlock drops very much.
this is because the cache bouncing in different physical cpus.

So to verify how both spinlock impact the data-cache,
another simple benchmark test.
code looks like:

struct _x {
	spinlock_t lk;
	unsigned long x;
} x;

while(!this_cpu_read(stop)) {
	int i = 0xff
	spin_lock(x.lk)
	this_cpu_inc(loops)
	while(i--)
		READ_ONCE(x.x);
	spin_unlock(x.lk)
}

the result below shows what the sum of loops is in the end, with xxxxK form.

cpu count	| pv-qspinlock	| test-set spinlock|
------------------------------------------------
8 (pHyp)	|	13240K	|	9780K	|
------------------------------------------------
32 (pHyp)	|	25790K	|	9700K	|
------------------------------------------------

obviously pv-qspinlock is more cache-friendly, and has better performance than test-set spinlock.

More test is going on, I will send out new patch set with the result.
HOPE *within* this week. unixbench really takes a long time.

thanks
xinhui
pv-qspinlock tries to preserve the fairness while allowing limited lock
stealing and explicitly managing which vcpus to wake.

Right.

	While there's
    theory and to some extent practice on x86, it would be nice to
    validate the effects on POWER.
Right; so that will have to be from benchmarks which I cannot help you
with ;-)

Precisely :-) This is what I was asking for ;-)

Cheers,
Ben.

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization