On Thu, Mar 26, 2020 at 09:42:19PM +0000, Trond Myklebust wrote: > On Thu, 2020-03-26 at 16:40 -0400, bfields@xxxxxxxxxxxx wrote: > > Maybe the cache_is_expired() logic should be something more like: > > > > if (h->expiry_time < seconds_since_boot()) > > return true; > > if (!test_bit(CACHE_VALID, &h->flags)) > > return false; > > return h->expiry_time < seconds_since_boot(); > > > > So invalid cache entries (which are waiting for a reply from mountd) > > can > > expire, but they can't be flushed. If that makes sense. > > > > As a stopgap we may want to revert or drop the "Allow garbage > > collection" patch, as the (preexisting) memory leak seems lower > > impact > > than the server hang. > > I believe you were probably seeing the effect of the > cache_listeners_exist() test, which is just wrong for all cache upcall > users except idmapper and svcauth_gss. We should not be creating > negative cache entries just because the rpc.mountd daemon happens to be > slow to connect to the upcall pipes when starting up, or because it > crashes and fails to restart correctly. > > That's why, when I resubmitted this patch, I included > https://git.linux-nfs.org/?p=cel/cel-2.6.git;a=commitdiff;h=b840228cd6096bebe16b3e4eb5d93597d0e02c6d > > which turns off that particular test for all the upcalls to rpc.mountd. The hangs persist with that patch, but go away with the change to the cache_is_expired() logic above. --b.