On Mon, 2012-11-12 at 16:02 -0500, Bryan Schumaker wrote: > On 11/12/2012 03:54 PM, Myklebust, Trond wrote: > > On Mon, 2012-11-12 at 15:49 -0500, Andy Adamson wrote: > >> On Mon, Nov 12, 2012 at 3:29 PM, Myklebust, Trond > >> <Trond.Myklebust@xxxxxxxxxx> wrote: > >>> On Mon, 2012-11-12 at 15:22 -0500, Andy Adamson wrote: > >>>> On Mon, Nov 12, 2012 at 1:35 PM, <bjschuma@xxxxxxxxxx> wrote: > >>>>> From: Bryan Schumaker <bjschuma@xxxxxxxxxx> > >>>>> > >>>>> During recovery the NFS4_SESSION_DRAINING flag might be set on the > >>>>> client structure. This can cause lease renewal to abort when > >>> > >>> Not lease renewal. It is lease verification (i.e. checking that we have > >>> a valid lease and session) that will deadlock. > >>> > >>>>> nfs41_setup_sequence sees that we are doing recovery. As a result, the > >>>>> client never recovers and all activity with the NFS server halts. > >>>> > >>>> > >>>> When does this happen? Say the session is draining, and an RPC returns > >>>> from one of the compounds such as nfs_open, nfs_locku etc whose > >>>> rpc_call_done routine calls renew_lease after freeing it's slot. Like > >>>> all calls on a draining session, the renew_lease sleeps on the session > >>>> slot_tbl_waitq - is this what you mean by "causes lease renewal to > >>>> abort"? How does this cause the client to never recover? > >>>> > >>>> The only other call to renew_lease is from the state manager itself > >>>> which runs one function at a time, and will not call renew_lease until > >>>> the recovery is over.... > >>>> > >>>> What am I missing....? > >>> > >>> nfs4_check_lease() is run exclusively by the state manager thread in > >>> order to check that the clientid (and session id on NFSv4.1) are valid. > >>> It will deadlock the state manager thread if NFS4_SESSION_DRAINING is > >>> already set. > >> > >> OK. NFS4_SESSION_DRAINING is also set exclusively by the state manager > >> thread. Why is the state manager told to check the lease when it's > >> also draining a session? > > > > NFS4_SESSION_DRAINING does not just mean that the session is being > > drained; it remains set until open and lock state recovery is completed. > > > > IOW: if something happens during state recovery that requires us to > > check the lease (e.g. something returns NFS4ERR_EXPIRED) then the > > current code will deadlock. > > > >> No matter what the answer, please update the patch description to > >> better explain the problem being solved. > > > > ACK. I agree that the changelog entry can be improved. > > > > Does this read any better (and should I resend the patch)? > > NFS: Add sequence_priviliged_ops for nfs4_proc_sequence() > > If I mount an NFS v4.1 server to a single client multiple times and then > run xfstests over each mountpoint I usually get the client into a state > where recovery deadlocks. The server informs the client of a > cb_path_down sequence error, the client then does a > bind_connection_to_session and checks the status of the lease. Why are you getting the cb_path_down? Is that a result of a fault injection, or is it a genuine error? While cb_path_down is a valid error, and we do want the recovery to work correctly, I would expect it to be rare since it implies that we lost the TCP connection to the server for some reason. Finding out why it happens is therefore worth doing. > I found that bind_connection_to_session sets the NFS4_SESSION_DRAINING > flag on the client, but this flag is never unset before > nfs4_check_lease() reaches nfs4_proc_sequence(). This causes the client > to deadlock, halting all NFS activity to the server. Otherwise, the text is more or less OK. The one thing that I'm missing is a statement about why nfs4_proc_sequence() needs to be a privileged operation. -- Trond Myklebust Linux NFS client maintainer NetApp Trond.Myklebust@xxxxxxxxxx www.netapp.com ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥