Re: FYI, fsnotify contention with aio and io_uring.

Jens Axboe <axboe@xxxxxxxxx> · Tue, 8 Aug 2023 15:41:05 -0600

On 8/7/23 2:11?PM, Jeff Moyer wrote:
> Hi, Pierre,
> 
> Pierre Labat <plabat@xxxxxxxxxx> writes:
> 
>> Hi,
>>
>> This is FYI, may be you already knows about that, but in case you don't....
>>
>> I was pushing the limit of the number of nvme read IOPS, the FIO + the
>> Linux OS can handle. For that, I have something special under the
>> Linux nvme driver. As a consequence I am not limited by whatever the
>> NVME SSD max IOPS or IO latency would be.
>>
>> As I cranked the number of system cores and FIO jobs doing direct 4k
>> random read on /dev/nvme0n1, I hit a wall. The IOPS scaling slows
>> (less than linear) and around 15 FIO jobs on 15 core threads, the
>> overall IOPS, in fact, goes down as I add more FIO jobs. For example
>> on a system with 24 cores/48 threads, when I goes beyond 15 FIO jobs,
>> the overall IOPS starts to go down.
>>
>> This happens the same for io_uring and aio. Was using kernel version 6.3.9. Using one namespace (/dev/nvme0n1).
> 
> [snip]
> 
>> As you can see 76% of the cpu on the box is sucked up by
>> lockref_get_not_zero() and lockref_put_return().  Looking at the code,
>> there is contention when IO_uring call fsnotify_access().
> 
> Is there a FAN_MODIFY fsnotify watch set on /dev?  If so, it might be a
> good idea to find out what set it and why.

This would be my guess too, some distros do seem to do that. The
notification bits scale horribly, nobody should use it for anything high
performance...

-- 
Jens Axboe