Re: [PATCH 1/1] NFS fix race in state recovery

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 20 Jun 2017 19:46:09 +0000



On Mon, 2017-06-19 at 16:36 -0400, Olga Kornievskaia wrote:
> An application can fail with EIO when the server reboots and the
> client while doing the recovery is getting into the following
> situation where it incorrectly chooses "nograce" recovery instead
> of "reboot" recovery. The necessary trigger is a pnfs mount and 2 DS
> reads on the same file fail with BAD_STATEID each read uses its own
> lock stateid (and MDS server reboots losing state). Since  server
> rebooted it will fail recovery ops with BAD_SESSION and then also
> STALE_ID triggering session and clientid recovery.
> 
> 1. Thread1 gets BAD_STATEID initiates stateid recovery calls
> nfs4_state_mark_reclaim_no_grace() which sets the cl_state->flags and
> state->flags respective _NOGRACE flags.
> 
> 2. State manager gets to run and it clears the _NOGRACE from the
> cl_state. Calls nfs4_do_reclaim() which sends TEST_STATEID which gets
> a BAD_SESSION.
> 
> 3. Thread2 now gets control and it also processing its BAD_STATEID
> and
> calls nfs4_state_mark_reclaim_no_grace() and it again sets the
> state/cl_state->flags to have _NOGRACE.
> 
> 4. State manager runs and continues with state recovery by sending
> DESTROY_SESSION, and then CREATE_SESSION which fails with
> STALE_CLIENTID. Now we are recovering the lease.
> 
> 5. State manager in nfs4_recovery_handle_error for STALE_CLIENTID,
> calls nfs4_state_start_reclaim_reboot() calls
> nfs4_state_mark_reclaim_reboot() which has a check
> 
> /* Don't recover state that expired before the reboot */
> if (test_bit(NFS_STATE_RECLAIM_NOGRACE, &state->flags)) {
>   clear_bit(NFS_STATE_RECLAIM_REBOOT, &state->flags);
>   return 0;
> }
> 
> Because the _NOGRACE was set again in Step3, we end up clearing
> _RECLAIM_REBOOT and not setting _RECLAIM_REBOOT in the cl_state-
> >flags
> and choose to do "nograce" operation instead of recovery.
> 
> Proposing to introduce a new state flag indicating recovery is
> going on and when set in nfs4_state_mark_reclaim_nograce() return.
> 
This looks to be a consequence of treating the error codes from the DS
as being indistinguishable from the MDS ones. I think the right fix
here is rather to treat the DS error codes as an indication that we
need to drop the layout and retry the I/O (either resend a layoutget,
or retry the I/O through the MDS).

-- 
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.myklebust@xxxxxxxxxxxxxxx
��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥