On Sat, 2019-08-03 at 12:07 -0700, John Hubbard wrote: > On 8/3/19 7:40 AM, Trond Myklebust wrote: > > John Hubbard reports seeing the following stack trace: > > > > nfs4_do_reclaim > > rcu_read_lock /* we are now in_atomic() and must not sleep */ > > nfs4_purge_state_owners > > nfs4_free_state_owner > > nfs4_destroy_seqid_counter > > rpc_destroy_wait_queue > > cancel_delayed_work_sync > > __cancel_work_timer > > __flush_work > > start_flush_work > > might_sleep: > > (kernel/workqueue.c:2975: > > BUG) > > > > The solution is to separate out the freeing of the state owners > > from nfs4_purge_state_owners(), and perform that outside the atomic > > context. > > > > All better now--this definitely fixes it. I can reboot the server, > and > of course that backtrace is gone. Then the client mounts hang, so I > do > a mount -a -o remount, and after about 1 minute, the client mounts > start working again, with no indication of problems. I assume that > the > pause is by design--timing out somewhere, to recover from the server > going missing for a while. If so, then all is well. > Thanks very much for the report, and for testing! With regards to the 1 minute delay, I strongly suspect that what you are seeing is the NFSv4 "grace period". After a NFSv4.x server reboot, the clients are given a certain amount of time in which to recover the file open state and lock state that they may have held before the reboot. All non-recovery opens, locks and all I/O are stopped while this recovery process is happening to ensure that locking conflicts do not occur. This ensures that all locks can survive server reboots without any loss of atomicity. With NFSv4.1 and NFSv4.2, the server can determine when all the clients have finished recovering state and end the grace period early, however I've recently seen cases where that was not happening. I'm not sure yet if that is a real server regression. Cheers Trond -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx