Re: [PATCH 15/45] xfs: CIL work is serialised, not pipelined

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 10 Mar 2021 17:11:51 +1100

On Tue, Mar 09, 2021 at 02:35:42PM -0800, Andi Kleen wrote:
> "Darrick J. Wong" <djwong@xxxxxxxxxx> writes:
> > It might be nice to leave that as a breadcrumb, then, in case the
> > spinlock scalability problems ever get solved.
> 
> It might be already solved, depending on if Dave's rule of thumb
> was determined before the Linux spinlocks switched to MCS locks or not.

It's what I see on my current 2-socket, 32p/64t machine with a
handful of optane DC4800 SSDs attached to it running 5.12-rc1+.  MCS
doesn't make spin contention go away, just stops spinlocks from
bouncing the same cacheline all over the machine.

i.e. if you've got more than a single CPU's worth of critical
section to execute, then spinlocks are going to spin, not matter how
they are implemented. So AFAICT the contention I'm measuring is not
cacheline bouncing, but just the cost of multiple CPUs
spinning while queued waiting for the spinlock...

> In my experience spinlock scalability depends a lot on how long the
> critical section is (that is very important, short sections are a lot
> worse than long sections), as well as if the contention is inside a
> socket or over sockets, and the actual hardware behaves differently too.

Yup, and most of the critical sections that the icloglock is used
to protect are quite short.

> So I would be quite surprised if the "rule of 4" generally holds.

It's served me well for the past couple of decades, especially when
working with machines that have thousands of CPUs that can turn even
lightly trafficed spin locks into highly contended locks. That was
the lesson I learnt from this commit:

commit 249a8c1124653fa90f3a3afff869095a31bc229f
Author: David Chinner <dgc@xxxxxxx>
Date:   Tue Feb 5 12:13:32 2008 +1100

    [XFS] Move AIL pushing into it's own thread

    When many hundreds to thousands of threads all try to do simultaneous
    transactions and the log is in a tail-pushing situation (i.e. full), we
    can get multiple threads walking the AIL list and contending on the AIL
    lock.

Getting half a dozen simultaneous AIL pushes, and the AIL spinlock
would break down and burn an entire 2048p machine for half a day
doing what should only take half a second. Unbound concurrency
-always- finds spinlocks to contend on. And if the machine is large
enough, it will then block up the entire machine as more and more
CPUs hit the serialisation point.

As long as I've worked with 500+ cpu machines (since 2002),
scalability has always been about either removing spinlocks from
hot paths or controlling concurrency to a level below where a
spinlock breaks down. You see it again and again in XFS commit logs
where I've either changed something to be lockless or to strictly
constrain concurrency to be less than 4-8p across known hot and/or
contended spinlocks and mutexes.

And I've used it outside XFS, too. It was the basic concept behind
the NUMA aware shrinker infrastructure and the per-node LRU lists
that it uses. Even the internal spinlocks on those lists start to
break down when bashing on the inode and dentry caches on systems
with per-node CPU counts of 8-16...

Oh, another "rule of 4" I came across a couple of days ago. My test
machine has 4 node, so 4 kswapd threads. One buffered IO reader,
running 100% cpu bound at 2GB/s from a 6GB/s capable block device.
The reader was burning 12% of that CPU on the mapping spinlock
insert pages into the page cache..  The kswapds were each burning
12% of a CPU on the same mapping spinlock reclaiming page cache
pages. So, overall, the system was burning over 50% of a CPU
spinning on the mapping spinlock and really only doing about half a
CPU worth of real work.

Same workload with one kswapd (i.e. single node)? Contention on the
mapping lock is barely measurable.  IOws, at just 5 concurrent
threads doing repeated fast accesses to the inode mapping spinlock
we have reached lock breakdown conditions.

Perhaps we should consider spinlocks harmful these days...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx