On Mon, 2017-06-19 at 16:36 -0400, Olga Kornievskaia wrote: > An application can fail with EIO when the server reboots and the > client while doing the recovery is getting into the following > situation where it incorrectly chooses "nograce" recovery instead > of "reboot" recovery. The necessary trigger is a pnfs mount and 2 DS > reads on the same file fail with BAD_STATEID each read uses its own > lock stateid (and MDS server reboots losing state). Since server > rebooted it will fail recovery ops with BAD_SESSION and then also > STALE_ID triggering session and clientid recovery. > > 1. Thread1 gets BAD_STATEID initiates stateid recovery calls > nfs4_state_mark_reclaim_no_grace() which sets the cl_state->flags and > state->flags respective _NOGRACE flags. > > 2. State manager gets to run and it clears the _NOGRACE from the > cl_state. Calls nfs4_do_reclaim() which sends TEST_STATEID which gets > a BAD_SESSION. > > 3. Thread2 now gets control and it also processing its BAD_STATEID > and > calls nfs4_state_mark_reclaim_no_grace() and it again sets the > state/cl_state->flags to have _NOGRACE. > > 4. State manager runs and continues with state recovery by sending > DESTROY_SESSION, and then CREATE_SESSION which fails with > STALE_CLIENTID. Now we are recovering the lease. > > 5. State manager in nfs4_recovery_handle_error for STALE_CLIENTID, > calls nfs4_state_start_reclaim_reboot() calls > nfs4_state_mark_reclaim_reboot() which has a check > > /* Don't recover state that expired before the reboot */ > if (test_bit(NFS_STATE_RECLAIM_NOGRACE, &state->flags)) { > clear_bit(NFS_STATE_RECLAIM_REBOOT, &state->flags); > return 0; > } > > Because the _NOGRACE was set again in Step3, we end up clearing > _RECLAIM_REBOOT and not setting _RECLAIM_REBOOT in the cl_state- > >flags > and choose to do "nograce" operation instead of recovery. > > Proposing to introduce a new state flag indicating recovery is > going on and when set in nfs4_state_mark_reclaim_nograce() return. > This looks to be a consequence of treating the error codes from the DS as being indistinguishable from the MDS ones. I think the right fix here is rather to treat the DS error codes as an indication that we need to drop the layout and retry the I/O (either resend a layoutget, or retry the I/O through the MDS). -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥