XFS/ext4/others all need to lock the inode for buffered writes. Since io_uring handles any IO in an async manner, this means that for higher queue depth buffered write workloads, we have a lot of workers hammering on the same mutex. Running a QD=32 random write workload on my test box yields about 200K 4k random write IOPS with io_uring. Looking at system profiles, we're spending about half the time contending on the inode mutex. Oof. For buffered writes, we don't necessarily need a huge amount of threads issuing that IO. If we instead rely on normal flushing to take care of getting the parallelism we need on the device side, we can limit ourselves to a much lower depth. This still gets us async behavior on the submission side. With this small series, my 200K IOPS goes to 370K IOPS for the same workload. This issue came out of postgres implementing io_uring support, and reporting some of the issues they saw. -- Jens Axboe