On Fri, May 27, 2022 at 06:59:47PM +0000, Chuck Lever III wrote: > > > Hi Frank- > > Bruce recently reminded me about this issue. Is there a bugzilla somewhere? > Do you have a reproducer I can try? Hi Chuck, The easiest way to reproduce the issue is to run generic/531 over an NFSv4 mount, using a system with a larger number of CPUs on the client side (or just scaling the test up manually - it has a calculation based on the number of CPUs). The test will take a long time to finish. I initially described the details here: https://lore.kernel.org/linux-nfs/20200608192122.GA19171@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ Since then, it was also reported here: https://lore.kernel.org/all/20210531125948.2D37.409509F4@xxxxxxxxxxxx/T/#m8c3e4173696e17a9d5903d2a619550f352314d20 I posted an experimental patch, but it's actually not quite correct (although I think the idea behind it is makes sense): https://lore.kernel.org/linux-nfs/20201020183718.14618-4-trondmy@xxxxxxxxxx/T/#m869aa427f125afee2af9a89d569c6b98e12e516f The basic problem from the initial email I sent: > So here's what happens: for NFSv4, files that are associated with an > open stateid can stick around for a long time, as long as there's no > CLOSE done on them. That's what's happening here. Also, since those files > have a refcount of >= 2 (one for the hash table, one for being pointed to > by the state), they are never eligible for removal from the file cache. > Worse, since the code call nfs_file_gc inline if the upper bound is crossed > (8192), every single operation that calls nfsd_file_acquire will end up > walking the entire LRU, trying to free files, and failing every time. > Walking a list with millions of files every single time isn't great. I guess the basic issues here are: 1) Should these NFSv4 files be in the filecache at all? They are readily available from the state, no need for additional caching. 2) In general: can state ever be gracefully retired if the client still has an OPEN? - Frank