We recently noticed that, with 5.4+ kernels, the generic/531 test takes a very long time to finish for v4, especially when run on larger systems. Case in point: a 72 VCPU, 144G EC2 instance as a client will make the test last about 20 hours. So, I had a look to see what was going on. First of all, the test generates a lot of files - what it does is generate 50000 files per process, where it starts 2 * NCPU processes. So that's 144 processes in this case, 50000 files each. Also, it does it by setting the file ulimit to 50000, and then just opening files, keeping them open, until it hits the limit. So that's 7 million new/open files - that's a lot, but the problem can be triggered with far fewer than that as well. Looking at what the server was doing, I noticed a lot of lock contention for nfsd_file_lru. Then I noticed that that nfsd_filecache_count kept going up, reflecting the number of open files by the client processes, eventually reaching, for example, that 7 million number. So here's what happens: for NFSv4, files that are associated with an open stateid can stick around for a long time, as long as there's no CLOSE done on them. That's what's happening here. Also, since those files have a refcount of >= 2 (one for the hash table, one for being pointed to by the state), they are never eligible for removal from the file cache. Worse, since the code call nfs_file_gc inline if the upper bound is crossed (8192), every single operation that calls nfsd_file_acquire will end up walking the entire LRU, trying to free files, and failing every time. Walking a list with millions of files every single time isn't great. There are some ways to fix this behavior like: * Always allow v4 cached file structured to be purged from the cache. They will stick around, since they still have a reference, but at least they won't slow down cache handling to a crawl. * Don't add v4 files to the cache to begin with. * Since the only advantage of the file cache for v4 is the caching of files linked to special stateids (as far as I can tell), only cache files associated with special state ids. * Don't bother with v4 files at all, and revert the changes that made v4 use the file cache. In general, the resource control for files OPENed by the client is probably an issue. Even if you fix the cache, what if there are N clients that open millions of files and keep them open? Maybe there should be a fallback to start using temporary open files if a client goes beyond a reasonable limit and threatens to eat all resources. Thoughts? - Frank