FYI, fsnotify contention with aio and io_uring.

Pierre Labat <plabat@xxxxxxxxxx> · Fri, 4 Aug 2023 17:47:25 +0000

Hi,

This is FYI, may be you already knows about that, but in case you don't....

I was pushing the limit of the number of nvme read IOPS,  the  FIO + the Linux OS can handle. For that, I have something special under the Linux nvme driver. As a consequence I am not limited by whatever the NVME SSD max IOPS or IO latency would be.

As I cranked the number of system cores and FIO jobs doing direct 4k random read on /dev/nvme0n1, I hit a wall. The IOPS scaling slows (less than linear) and around 15 FIO jobs on 15 core threads, the overall IOPS, in fact, goes down as I add more FIO jobs. For example on a system with 24 cores/48 threads, when I goes beyond 15 FIO jobs, the overall IOPS starts to go down.

This happens the same for io_uring and aio. Was using kernel version 6.3.9. Using one namespace (/dev/nvme0n1).

Did some profiling to know why. On a 24 cores/48 threads with FIO 48 jobs, I got for the io_uring case:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles'
# Event count (approx.): 1858618550304
#
# Overhead  Command          Shared Object                 Symbol                                     
# ........  ...............  ............................  ...........................................
#
    39.46%  fio              [kernel.vmlinux]              [k] lockref_get_not_zero
            |
            ---lockref_get_not_zero
               dget_parent
               __fsnotify_parent
               io_read
               io_issue_sqe
               io_submit_sqes
               __do_sys_io_uring_enter
               do_syscall_64
               entry_SYSCALL_64
               syscall
.
.
.
    36.03%  fio              [kernel.vmlinux]              [k] lockref_put_return
            |
            ---lockref_put_return
               dput
               __fsnotify_parent
               io_read
               io_issue_sqe
               io_submit_sqes
               __do_sys_io_uring_enter
               do_syscall_64
               entry_SYSCALL_64
               syscall
.
.

As you can see 76% of the cpu on the box is sucked up by lockref_get_not_zero() and lockref_put_return().
Looking at the code, there is contention when IO_uring call fsnotify_access().
The filesystem code fsnotify_access() ends up calling dget_parent() and later dput() to take a reference on the parent directory (that would be /dev/ in our case), and later release the reference.
This is done (get+put) for each IO. 

The dget increments a unique/same counter (for the /dev/ directory)  and dput decrements this same counter.

As a consequence we have 24 cores/48 threads fighting to get the same counter in their cache to modify it. At a rate of M of iops. That is disastrous.

To work around that problem, and continue my scalability testing, I acked io_uring and aio to set the flag FMODE_NONOTIFY in the struct file->f_mode of the file on which IOs are done.
Doing that forces fsnotify to do nothing. The iops immediately went up more than 4X and the fsnotify trashing disappeared. 

May be it would be a good idea to add an option to FIO to disable fsnotify on the file[s] on which IOs are issued?
Or to take a reference on the file parent directory only once when fio starts?

Regards,

Pierre