Hi Dave, thanks for addressing this issue! The readdir on large directory that's permanently changes is a pain in a b... We have solved it by creating a snapshot of a directory on the server and identifying this snapshot by cookie verifier. This is probably it not posix compliant, but works for us. I am happy to run some tests. Tigran. ----- Original Message ----- > From: "David Wysochanski" <dwysocha@xxxxxxxxxx> > To: "trondmy" <trondmy@xxxxxxxxxxxxxxx>, "Anna Schumaker" <anna.schumaker@xxxxxxxxxx> > Cc: "linux-nfs" <linux-nfs@xxxxxxxxxxxxxxx> > Sent: Monday, 2 November, 2020 14:50:00 > Subject: [PATCH 00/11] Add NFS readdir tracepoints and improve performance of reading directories > Note these patches are also on github at: > https://github.com/DaveWysochanskiRH/kernel/tree/5.9-nfs-readdir > > This patchset is mostly tracepoints but patch 9 fixes a performance > problem with NFSv4.x when listing a directory that is being modified. > After the NFSv4.x patch, for consistency, patch 11 is included which > makes NFSv3 behave the same as NFSv4, and given the inclusion of > patch 9 does not re-introduce the regression fixed in previous commit > 79f687a3de9e. As described in patch 11, NFSv3 does not currently have > the NFSv4 problem because it does not drop the pagecache when a > large directory read is in progress. > > As background, currently any process that is listing a directory > must search starting at cookie 0 at each entry to nfs_readdir(), > regardless of the previous cookie returned by the server, regardless of > NFS version, and regardless of whether the directory is being modified > or cache expired. This means, for example, if a large directory listing > requires many getdents64/nfs_readdir calls, the kernel will search > through the pagecache with an increasing penalty as the size of the > directory increases and the process attempts to obtain the latter > entries of the directory. When the directory is being modified, and > when acdirmax is exceeded, and nfs_attribute_cache_expired() returns > true, with NFSv4.x we drop the pagecache for the directory entries, > so the process cannot make forward progress. > > I investigated using a similar approach as was done with NFSv3 in > commit 79f687a3de9e, but for NFSv4, READDIR does not return the directory > attributes and thus _nfs4_proc_readdir does not call nfs_refresh_inode() > unlike nfs3_proc_readdir(). With NFSv3, the calling of nfs_refresh_inode() > is what updates nfsi->read_cache_jiffies causing nfs_attribute_cache_expired() > to always return false, leaving the pagecache in tact despite > NFS_INO_INVALID_DATA being set. Thus it's not clear whether the NFSv3 > approach could be done for NFSv4 to achieve the same behavior as with > NFSv3. In addition, the current NFSv3 approach leaves in place the > aforementioned penalty, which for large directories can be substantial. > So rather than this approach, the approach taken with NFSv4 leaves > in place the dropping of the pagecache when acdirmax expires, and > relies on the fact that a process can trust the last cookie returned > by the server and continue at that point in the pagecache without > forcing the current process to start over from cookie 0 upon the > next entry to nfs_readdir(). As far as I could tell there is no > POSIX requirement to start from 0 again when a directory is being > modified during an in-progress directory read, and so the current > behavior is only an implementation limitation. > > The NFSv4 performance problem can be seen by exporting a directory > with a larger number of files such that the uncached time to list the > directory exceeds acdirmax. I have an explicit reproducer script > I can provide, but a simple reproducer outline is as follows: > > 1. First export a directory that contains enough files that the > listing exceeds acdirmax. In my testing, the NFS server and client > were the same system, there were 500,000 files in the directory, > and I set acdirmin=10,acdirmax=20. > > 2. Mount client1 (writer) and client2 (reader). Note below I use > the loopback IP address, and a 'wsize' parameter on the writer mount > so we get different superblocks: > mount -o vers=4.1 127.0.0.1:/export/bz1889478 $MNT1 > mount -o vers=4.1,wsize=65536 127.0.0.1:/export/bz1889478 $MNT2 > > 3. Start client1 (writer): > echo 3 > /proc/sys/vm/drop_caches > i=500000; while true; do touch $MNT2/file$i.bin; i=$((i + 1)); sleep 1; done > > /dev/null 2>&1 & > > 4. Start client2 (reader): > while true; do time ls -1f $MNT1 | wc -l; done & > > The output from my reproducer script is: > ./t0_bz1889478.sh 4.1 127.0.0.1 500000 > Mon 02 Nov 2020 07:54:22 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and > files 500000 directory is idle > Mon 02 Nov 2020 07:54:22 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 > /mnt/nfs/bz1889478-mnt1 > Mon 02 Nov 2020 07:54:22 AM EST: dropping caches with: echo 3 > > /proc/sys/vm/drop_caches > Mon 02 Nov 2020 07:54:24 AM EST: idle directory: skipping client2 adding files > Mon 02 Nov 2020 07:54:24 AM EST: client1 running command: ls -lf > /mnt/nfs/bz1889478-mnt1 > Mon 02 Nov 2020 07:54:24 AM EST: waiting for listing to complete > Mon 02 Nov 2020 07:54:46 AM EST: client1 pid took 22 seconds to list the > directory of 500000 files > Mon 02 Nov 2020 07:54:46 AM EST: client1 READDIR ops total: 4902 (before = 0 > after = 4902) > Mon 02 Nov 2020 07:54:47 AM EST: umounting /mnt/nfs/bz1889478-mnt1 > /mnt/nfs/bz1889478-mnt2 > Mon 02 Nov 2020 07:54:47 AM EST: Running with NFS vers=4.1 server 127.0.0.1 and > files 500000 directory being modified > Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1 127.0.0.1:/export/bz1889478 > /mnt/nfs/bz1889478-mnt1 > Mon 02 Nov 2020 07:54:47 AM EST: mount -o vers=4.1,wsize=65536 > 127.0.0.1:/export/bz1889478 /mnt/nfs/bz1889478-mnt2 > Mon 02 Nov 2020 07:54:47 AM EST: dropping caches with: echo 3 > > /proc/sys/vm/drop_caches > Mon 02 Nov 2020 07:54:50 AM EST: changing directory: client2 start adding 1 > file/sec at /mnt/nfs/bz1889478-mnt2 > Mon 02 Nov 2020 07:54:51 AM EST: client1 running command: ls -lf > /mnt/nfs/bz1889478-mnt1 > Mon 02 Nov 2020 07:54:51 AM EST: waiting for listing to complete > Mon 02 Nov 2020 07:55:12 AM EST: client1 pid took 21 seconds to list the > directory of 500000 files > Mon 02 Nov 2020 07:55:12 AM EST: client1 READDIR ops total: 4902 (before = 0 > after = 4902) > ./t0_bz1889478.sh: line 124: 5973 Killed while true; do > echo $i; touch $NFS_CLIENT_MOUNTPOINT2/file$i.bin; i=$((i + 1)); sleep 1; > done > /dev/null 2>&1 > Mon 02 Nov 2020 07:55:20 AM EST: umounting /mnt/nfs/bz1889478-mnt1 > /mnt/nfs/bz1889478-mnt2 > PASSED TEST ./t0_bz1889478.sh on kernel 5.9.0-nfs-readdir+ with NFS vers=4.1 > > For diagnostics and verification, of course a tcpdump can be > used, or even READDIR ops and time can be compared as in the > reproducer, but also the included tracepoints can be used. For > the tracepoints, before step #2 above use the below trace-cmd > to trace the listing and see whether the problem occurs or not, > evidenced by either the presence of nfs_invalidate_mapping* > trace events or multiple nfs_readdir_enter calls with > "cookie=0x00000000": > trace-cmd start -e nfs:nfs_readdir_enter -e nfs4:nfs4_readdir -e > nfs:nfs_readdir_exit -e nfs:nfs_invalidate_mapping_* > > > Dave Wysochanski (11): > NFSv4: Improve nfs4_readdir tracepoint by adding additional fields > NFS: Replace dfprintk statements with trace events in nfs_readdir > NFS: Move nfs_readdir_descriptor_t into internal header > NFS: Add tracepoints for functions involving nfs_readdir_descriptor_t > NFS: Add tracepoints for opendir, closedir, fsync_dir and llseek_dir > NFS: Add tracepoints for nfs_readdir_xdr_filler enter and exit > NFS: Add tracepoint to entry and exit of nfs_do_filldir > NFS: Replace LOOKUPCACHE dfprintk statements with tracepoints > NFS: Improve performance of listing directories being modified > NFS: Add page_index to nfs_readdir enter and exit tracepoints > NFS: Bring back nfs_dir_mapping_need_revalidate() in nfs_readdir() > > fs/nfs/dir.c | 101 +++++++++--------- > fs/nfs/internal.h | 18 ++++ > fs/nfs/nfs3xdr.c | 2 +- > fs/nfs/nfs4proc.c | 2 +- > fs/nfs/nfs4trace.h | 44 +++++++- > fs/nfs/nfstrace.h | 277 +++++++++++++++++++++++++++++++++++++++++++++++++ > include/linux/nfs_fs.h | 1 + > 7 files changed, 394 insertions(+), 51 deletions(-) > > -- > 1.8.3.1