On Thu, Aug 15, 2024 at 02:25:48PM +0200, Mateusz Guzik wrote: > I have an ext4-based system where xfs got mounted on tmpfs for testing > purposes. The directory is being used a lot by gcc when compiling. > > I'm testing with 24 compilers running in parallel, each operating on > their own hello world source file, listed at the end for reference. > > Both ext4 and btrfs backing the directory result in 100% cpu > utilization and about 1500 compiles/second. With xfs I see about 20% > idle(!) and about 1100 compiles/second. Yup, you're not using any of the allocation parallelism in XFS by running all the microbenchmark threads in the same directory. That serialises the tasks on inode and extent allocation and freeing because they all hit the same allocation group. Start by separating threads per directory because XFS puts directories in different allocation groups when they are allocated, and then keeps the contents of the directories local to the AG the directory is located in. This largely gives perfect scalability across directories as long as the filesystem has enough AGs in it. For scalability microbenchmarks, I tend to use ian AG count of 2x max thread count. i.e. for 24 threads, I'd probably use: # mkfs.xfs -d agcount=49 .... and put every thread instance in a newly created directory. For normal workloads (e.g. compiling a large source tree) this special setup step is not necessary. e.g. the creation of a large source tree naturally distributes all the directories and files over all the AGs in the filesystem and so there isn't a single AGI or AGF buffer lock that serialises the entire concurrent compilation. You're going to see the same thing with any other will-it-scale concurrency microbenchmark that has each thread allocate/free inodes or extents on files in the same directory. IOWs, this is purely a microbenchmarking setup issue, not a real world filesystem scalability issue. > The fact that this is contended aside, I'll note the stock semaphore > code does not do adaptive spinning, which avoidably significantly > worsens the impact. No, we most definitely do not want adaptive spinning. This is a long hold, non-owner sleeping lock - it is owned by the buffer, not the task that locks the buffer. The semaphore protects the contents of the buffer as IO is performed on it (i.e. while it has no task associated with it, but hardware is modifying the contents via asynchronous DMA). It is also held for long periods of time even when the task that locked it is on-cpu. Inode and extent allocation/freeing can involve updating multiple btrees that each contain millions of records, and all the buffers may be cached and so the task running the allocaiton and holding the AGI/AGF locked might actually run for many milliseconds before it yeilds the lock. We absolutely do not want tens of threads optimistically spinning on these locks when contention occurs - spinning locks areit is extremely power-inefficient and these locks are held long enough that you can measure spinning lock contention events via the power socket monitoring... > You can probably convert this to a rw semaphore > and only ever writelock, which should sort out this aspect. I did not > check what can be done to contend less to begin with. No. We cannot use any other Linux kernel lock for this, because they are all mutexes (including rwsems). The optimistic spinning is based on a task owning the lock and doing the unlock (that's why rwsems track the write owner task). We need *pure* sleeping semaphore locks for these buffers and we'd really, really like for rwsems to be pure semaphores and not a bastardised rwmutex for the same reasons.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx