Hi, Pierre, Pierre Labat <plabat@xxxxxxxxxx> writes: > Hi, > > This is FYI, may be you already knows about that, but in case you don't.... > > I was pushing the limit of the number of nvme read IOPS, the FIO + the > Linux OS can handle. For that, I have something special under the > Linux nvme driver. As a consequence I am not limited by whatever the > NVME SSD max IOPS or IO latency would be. > > As I cranked the number of system cores and FIO jobs doing direct 4k > random read on /dev/nvme0n1, I hit a wall. The IOPS scaling slows > (less than linear) and around 15 FIO jobs on 15 core threads, the > overall IOPS, in fact, goes down as I add more FIO jobs. For example > on a system with 24 cores/48 threads, when I goes beyond 15 FIO jobs, > the overall IOPS starts to go down. > > This happens the same for io_uring and aio. Was using kernel version 6.3.9. Using one namespace (/dev/nvme0n1). [snip] > As you can see 76% of the cpu on the box is sucked up by > lockref_get_not_zero() and lockref_put_return(). Looking at the code, > there is contention when IO_uring call fsnotify_access(). Is there a FAN_MODIFY fsnotify watch set on /dev? If so, it might be a good idea to find out what set it and why. > The filesystem code fsnotify_access() ends up calling dget_parent() > and later dput() to take a reference on the parent directory (that > would be /dev/ in our case), and later release the reference. This is > done (get+put) for each IO. > > The dget increments a unique/same counter (for the /dev/ directory) > and dput decrements this same counter. > > As a consequence we have 24 cores/48 threads fighting to get the same > counter in their cache to modify it. At a rate of M of iops. That is > disastrous. > > To work around that problem, and continue my scalability testing, I > acked io_uring and aio to set the flag FMODE_NONOTIFY in the struct > file->f_mode of the file on which IOs are done. Doing that forces > fsnotify to do nothing. The iops immediately went up more than 4X and > the fsnotify trashing disappeared. > > May be it would be a good idea to add an option to FIO to disable > fsnotify on the file[s] on which IOs are issued? Maybe I'm wrong, but that sounds like an abuse of the FMODE_NONOTIFY flag. > Or to take a reference on the file parent directory only once when fio > starts? Let's decide on whether or not the application is following best practices, first, starting with answering the questions I asked above. Cheers, Jeff