----- On 19 Oct, 2020, at 17:19, Daire Byrne daire@xxxxxxxx wrote: > ----- On 16 Sep, 2020, at 17:01, Daire Byrne daire@xxxxxxxx wrote: > >> Trond/Bruce, >> >> ----- On 15 Sep, 2020, at 20:59, Trond Myklebust trondmy@xxxxxxxxxxxxxxx wrote: >> >>> On Tue, 2020-09-15 at 13:21 -0400, J. Bruce Fields wrote: >>>> On Mon, Sep 07, 2020 at 06:31:00PM +0100, Daire Byrne wrote: >>>> > 1) The kernel can drop entries out of the NFS client inode cache >>>> > (under memory cache churn) when those filehandles are still being >>>> > used by the knfsd's remote clients resulting in sporadic and random >>>> > stale filehandles. This seems to be mostly for directories from >>>> > what I've seen. Does the NFS client not know that knfsd is still >>>> > using those files/dirs? The workaround is to never drop inode & >>>> > dentry caches on the re-export servers (vfs_cache_pressure=1). This >>>> > also helps to ensure that we actually make the most of our >>>> > actimeo=3600,nocto mount options for the full specified time. >>>> >>>> I thought reexport worked by embedding the original server's >>>> filehandles >>>> in the filehandles given out by the reexporting server. >>>> >>>> So, even if nothing's cached, when the reexporting server gets a >>>> filehandle, it should be able to extract the original filehandle from >>>> it >>>> and use that. >>>> >>>> I wonder why that's not working? >>> >>> NFSv3? If so, I suspect it is because we never wrote a lookupp() >>> callback for it. >> >> So in terms of the ESTALE counter on the reexport server, we see it increase if >> the end client mounts the reexport using either NFSv3 or NFSv4. But there is a >> difference in the client experience in that with NFSv3 we quickly get >> input/output errors but with NFSv4 we don't. But it does seem like the >> performance drops significantly which makes me think that NFSv4 retries the >> lookups (which succeed) when an ESTALE is reported but NFSv3 does not? >> >> This is the simplest reproducer I could come up with but it may still be >> specific to our workloads/applications and hard to replicate exactly. >> >> nfs-client # sudo mount -t nfs -o vers=3,actimeo=5,ro >> reexport-server:/vol/software /mnt/software >> nfs-client # while true; do /mnt/software/bin/application; echo 3 | sudo tee >> /proc/sys/vm/drop_caches; done >> >> reexport-server # sysctl -w vm.vfs_cache_pressure=100 >> reexport-server # while true; do echo 3 > /proc/sys/vm/drop_caches ; done >> reexport-server # while true; do awk '/fh/ {print $2}' /proc/net/rpc/nfsd; sleep >> 10; done >> >> Where "application" is some big application with lots of paths to scan with libs >> to memory map and "/vol/software" is an NFS mount on the reexport-server from >> another originating NFS server. I don't know why this application loading >> workload shows this best, but perhaps the access patterns of memory mapped >> binaries and libs is particularly susceptible to estale? >> >> With vfs_cache_pressure=100, running "echo 3 > /proc/sys/vm/drop_caches" >> repeatedly on the reexport server drops chunks of the dentry & nfs_inode_cache. >> The ESTALE count increases and the client running the application reports >> input/output errors with NFSv3 or the loading slows to a crawl with NFSv4. >> >> As soon as we switch to vfs_cache_pressure=0, the repeating drop_caches on the >> reexport server do not cull the dentry or nfs_inode_cache, the ESTALE counter >> no longer increases and the client experiences no issues (NFSv3 & NFSv4). > > I don't suppose anyone has any more thoughts on this one? This is likely the > first problem that anyone trying to NFS re-export is going to encounter. If > they re-export NFSv3 they'll just get lots of ESTALE as the nfs inodes are > dropped from cache (with the default vfs_cache_pressure=100) and if they > re-export NFSv4, the lookup performance will drop significantly as an ESTALE > triggers re-lookups. > > For our particular use case, it is actually desirable to have > vfs_cache_pressure=0 to keep nfs client inodes and dentry caches in memory to > help with expensive metadata lookups, but it would still be nice to have the > option of using a less drastic setting (such as vfs_cache_pressure=1) to help > avoid OOM conditions. Trond has posted some (v3) patches to emulate lookupp for NFSv3 (a million thanks!) so I applied them to v5.9.1 and ran some more tests using that on the re-export server. Again, I just pathologically dropped inode & dentry caches every second on the re-export server (vfs_cache_pressure=100) while a client looped through some application loading tests. Now for every combination of re-export (NFSv3 -> NFSv4.x or NFSv4.x -> NFSv3), I no longer see any stale file handles (/proc/net/rpc/nfsd) when dropping inode & dentry caches (yay!). However, my assumption that some of the input/output errors I was seeing were related to the estales seems to have been misguided. After running these tests again without any estales, it now looks like a different issue that is unique to re-exporting NFSv3 from an NFSv4.0 originating server (either Linux or Netapp). The lookups are all fine (no estale) but reading some files eventually gives an input/output error on multiple clients which remain consistent until the re-export nfs-server is restarted. Again, this only occurs while dropping inode + dentry caches. So in summary, while continuously dropping inode/dentry caches on the re-export server: originating server NFSv4.x -> NFSv4.x re-export server = good (no estale, no input/output errors) originating server NFSv4.1/4.2 -> NFSv3 re-export server = good originating server NFSv4.0 -> NFSv3 re-export server = no estale but lots of input/output errors originating server NFSv3 -> NFSv3 re-export server = good (fixed by Trond's lookupp emulation patches) originating server NFSv3 -> NFSv4.x re-export server = good (fixed by Trond's lookupp emulation patches) In our case, we are stuck with some old 7-mode Netapps so we only have two mount choices, NFSv3 or NFSv4.0 (hence our particular interest in the NFSv4.0 re-export behaviour). And as discussed previously, a re-export of an NFSv3 server requires my horrible hack in order to avoid excessive lookups and client cache invalidations. But these lookupp emulation patches fix the ESTALEs for the NFSv3 re-export cases, so many thanks again for that Trond. When re-exporting an NFSv3 client mount, we no longer need to change vfs_cache_pressure=0. Daire -- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cachefs