On Tue, Mar 09, 2021 at 02:35:42PM -0800, Andi Kleen wrote: > "Darrick J. Wong" <djwong@xxxxxxxxxx> writes: > > It might be nice to leave that as a breadcrumb, then, in case the > > spinlock scalability problems ever get solved. > > It might be already solved, depending on if Dave's rule of thumb > was determined before the Linux spinlocks switched to MCS locks or not. It's what I see on my current 2-socket, 32p/64t machine with a handful of optane DC4800 SSDs attached to it running 5.12-rc1+. MCS doesn't make spin contention go away, just stops spinlocks from bouncing the same cacheline all over the machine. i.e. if you've got more than a single CPU's worth of critical section to execute, then spinlocks are going to spin, not matter how they are implemented. So AFAICT the contention I'm measuring is not cacheline bouncing, but just the cost of multiple CPUs spinning while queued waiting for the spinlock... > In my experience spinlock scalability depends a lot on how long the > critical section is (that is very important, short sections are a lot > worse than long sections), as well as if the contention is inside a > socket or over sockets, and the actual hardware behaves differently too. Yup, and most of the critical sections that the icloglock is used to protect are quite short. > So I would be quite surprised if the "rule of 4" generally holds. It's served me well for the past couple of decades, especially when working with machines that have thousands of CPUs that can turn even lightly trafficed spin locks into highly contended locks. That was the lesson I learnt from this commit: commit 249a8c1124653fa90f3a3afff869095a31bc229f Author: David Chinner <dgc@xxxxxxx> Date: Tue Feb 5 12:13:32 2008 +1100 [XFS] Move AIL pushing into it's own thread When many hundreds to thousands of threads all try to do simultaneous transactions and the log is in a tail-pushing situation (i.e. full), we can get multiple threads walking the AIL list and contending on the AIL lock. Getting half a dozen simultaneous AIL pushes, and the AIL spinlock would break down and burn an entire 2048p machine for half a day doing what should only take half a second. Unbound concurrency -always- finds spinlocks to contend on. And if the machine is large enough, it will then block up the entire machine as more and more CPUs hit the serialisation point. As long as I've worked with 500+ cpu machines (since 2002), scalability has always been about either removing spinlocks from hot paths or controlling concurrency to a level below where a spinlock breaks down. You see it again and again in XFS commit logs where I've either changed something to be lockless or to strictly constrain concurrency to be less than 4-8p across known hot and/or contended spinlocks and mutexes. And I've used it outside XFS, too. It was the basic concept behind the NUMA aware shrinker infrastructure and the per-node LRU lists that it uses. Even the internal spinlocks on those lists start to break down when bashing on the inode and dentry caches on systems with per-node CPU counts of 8-16... Oh, another "rule of 4" I came across a couple of days ago. My test machine has 4 node, so 4 kswapd threads. One buffered IO reader, running 100% cpu bound at 2GB/s from a 6GB/s capable block device. The reader was burning 12% of that CPU on the mapping spinlock insert pages into the page cache.. The kswapds were each burning 12% of a CPU on the same mapping spinlock reclaiming page cache pages. So, overall, the system was burning over 50% of a CPU spinning on the mapping spinlock and really only doing about half a CPU worth of real work. Same workload with one kswapd (i.e. single node)? Contention on the mapping lock is barely measurable. IOws, at just 5 concurrent threads doing repeated fast accesses to the inode mapping spinlock we have reached lock breakdown conditions. Perhaps we should consider spinlocks harmful these days... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx