Re: perf loss on parallel compile due to conention on the buf semaphore

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 16 Aug 2024 08:56:00 +1000

On Thu, Aug 15, 2024 at 02:25:48PM +0200, Mateusz Guzik wrote:
> I have an ext4-based system where xfs got mounted on tmpfs for testing
> purposes. The directory is being used a lot by gcc when compiling.
> 
> I'm testing with 24 compilers running in parallel, each operating on
> their own hello world source file, listed at the end for reference.
>
> Both ext4 and btrfs backing the directory result in 100% cpu
> utilization and about 1500 compiles/second. With xfs I see about 20%
> idle(!) and about 1100 compiles/second.

Yup, you're not using any of the allocation parallelism in XFS by
running all the microbenchmark threads in the same directory. That
serialises the tasks on inode and extent allocation and freeing
because they all hit the same allocation group.

Start by separating threads per directory because XFS puts
directories in different allocation groups when they are allocated,
and then keeps the contents of the directories local to the AG the
directory is located in. This largely gives perfect scalability
across directories as long as the filesystem has enough AGs in it.

For scalability microbenchmarks, I tend to use ian AG count of 2x
max thread count.  i.e. for 24 threads, I'd probably use:

# mkfs.xfs -d agcount=49 ....

and put every thread instance in a newly created directory.

For normal workloads (e.g. compiling a large source tree) this
special setup step is not necessary. e.g. the creation of a large
source tree naturally distributes all the directories and files over
all the AGs in the filesystem and so there isn't a single AGI or AGF
buffer lock that serialises the entire concurrent compilation.

You're going to see the same thing with any other will-it-scale
concurrency microbenchmark that has each thread allocate/free inodes
or extents on files in the same directory.

IOWs, this is purely a microbenchmarking setup issue, not a real
world filesystem scalability issue.

> The fact that this is contended aside, I'll note the stock semaphore
> code does not do adaptive spinning, which avoidably significantly
> worsens the impact.

No, we most definitely do not want adaptive spinning. This is a long
hold, non-owner sleeping lock - it is owned by the buffer, not the
task that locks the buffer. The semaphore protects the contents of
the buffer as IO is performed on it (i.e. while it has no task
associated with it, but hardware is modifying the contents via
asynchronous DMA).

It is also held for long periods of time even when the task that
locked it is on-cpu. Inode and extent allocation/freeing can involve
updating multiple btrees that each contain millions of records, and
all the buffers may be cached and so the task running the allocaiton
and holding the AGI/AGF locked might actually run for many
milliseconds before it yeilds the lock.

We absolutely do not want tens of threads optimistically spinning on
these locks when contention occurs - spinning locks areit is
extremely power-inefficient and these locks are held long enough
that you can measure spinning lock contention events via the power
socket monitoring...

> You can probably convert this to a rw semaphore
> and only ever writelock, which should sort out this aspect. I did not
> check what can be done to contend less to begin with.

No.  We cannot use any other Linux kernel lock for this, because
they are all mutexes (including rwsems). The optimistic spinning is
based on a task owning the lock and doing the unlock (that's why
rwsems track the write owner task).

We need *pure* sleeping semaphore locks for these buffers and we'd
really, really like for rwsems to be pure semaphores and not a
bastardised rwmutex for the same reasons....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx