On Thu, 2020-03-26 at 21:50 -0400, J. Bruce Fields wrote: > On Thu, Mar 26, 2020 at 09:42:19PM +0000, Trond Myklebust wrote: > > On Thu, 2020-03-26 at 16:40 -0400, bfields@xxxxxxxxxxxx wrote: > > > Maybe the cache_is_expired() logic should be something more like: > > > > > > if (h->expiry_time < seconds_since_boot()) > > > return true; > > > if (!test_bit(CACHE_VALID, &h->flags)) > > > return false; > > > return h->expiry_time < seconds_since_boot(); Did you mean return detail->flush_time >= h->last_refresh; instead of repeating the h->expiry_time check? > > > > > > So invalid cache entries (which are waiting for a reply from > > > mountd) > > > can > > > expire, but they can't be flushed. If that makes sense. > > > > > > As a stopgap we may want to revert or drop the "Allow garbage > > > collection" patch, as the (preexisting) memory leak seems lower > > > impact > > > than the server hang. > > > > I believe you were probably seeing the effect of the > > cache_listeners_exist() test, which is just wrong for all cache > > upcall > > users except idmapper and svcauth_gss. We should not be creating > > negative cache entries just because the rpc.mountd daemon happens > > to be > > slow to connect to the upcall pipes when starting up, or because it > > crashes and fails to restart correctly. > > > > That's why, when I resubmitted this patch, I included > > https://git.linux-nfs.org/?p=cel/cel-2.6.git;a=commitdiff;h=b840228cd6096bebe16b3e4eb5d93597d0e02c6d > > > > which turns off that particular test for all the upcalls to > > rpc.mountd. > > The hangs persist with that patch, but go away with the change to the > cache_is_expired() logic above. Fair enough. Do you want to send Chuck a fix? -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx