On Sat, Sep 26, 2020 at 10:00:22AM +0100, Daire Byrne wrote: > > > ----- On 23 Sep, 2020, at 22:01, Frank van der Linden fllinden@xxxxxxxxxx wrote: > > It's entirely possible that my patch introduces a refcounting error - it was > > intended as a proof-of-concept on how to fix the LRU locking issue for v4 > > open file caching (while keeping it enabled) - which is why I didn't > > "formally" send it in. > > > > Having said that, I don't immediately see the problem. > > > > Maybe try it without the rhashtable patch, that is much less of an > > optimization. > > > > The problem would have to be nf_ref as part of nfsd_file, or fi_ref as part > > of nfs4_file. If it's the latter, it's probably the rhashtable change. > > Thanks Frank; I think you are right in that it seems to be a problem with the rhashtable patch. Another 48 hours using the same workload with just the main patch and I have not seen the same issue again so far. > > Also, it still has the effect of reducing the CPU usage dramatically such that there are plenty of cores still left idle. This is actually helping us buy some more time while we fix our obviously broken software so that it doesn't open/close so crazily. > > So, many thanks for that. Cool. I'm glad the "don't put v4 files on the LRU list" works as intended for you. The rhashtable patch was more of an afterthought, and obviously has an issue. It did provide some extra gains, so I'll see if I can find the problem if I get some time. Bruce - if you want me to 'formally' submit a version of the patch, let me know. Just disabling the cache for v4, which comes down to reverting a few commits, is probably simpler - I'd be able to test that too. - Frank