On Wed, 2017-02-15 at 16:56 -0500, Olga Kornievskaia wrote: > On Wed, Feb 15, 2017 at 3:48 PM, Trond Myklebust > <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Wed, 2017-02-15 at 15:16 -0500, Olga Kornievskaia wrote: > > > On Sun, Dec 4, 2016 at 7:40 PM, Trond Myklebust > > > <trond.myklebust@xxxxxxxxxxxxxxx> wrote: > > > > In the case where SEQUENCE receives a NFS4ERR_BADSESSION or > > > > NFS4ERR_DEADSESSION error, we just want to report the session > > > > as > > > > needing > > > > recovery, and then we want to retry the operation. > > > > > > > > Signed-off-by: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx > > > > > > > > > --- > > > > fs/nfs/nfs4proc.c | 4 ++++ > > > > 1 file changed, 4 insertions(+) > > > > > > > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c > > > > index f992281c9b29..4fd0e2b7b08e 100644 > > > > --- a/fs/nfs/nfs4proc.c > > > > +++ b/fs/nfs/nfs4proc.c > > > > @@ -816,6 +816,10 @@ static int nfs41_sequence_process(struct > > > > rpc_task *task, > > > > case -NFS4ERR_SEQ_FALSE_RETRY: > > > > ++slot->seq_nr; > > > > goto retry_nowait; > > > > + case -NFS4ERR_DEADSESSION: > > > > + case -NFS4ERR_BADSESSION: > > > > + nfs4_schedule_session_recovery(session, res- > > > > > sr_status); > > > > > > > > + goto retry_nowait; > > > > default: > > > > /* Just update the slot sequence no. */ > > > > slot->seq_done = 1; > > > > -- > > > > 2.9.3 > > > > > > Hi Trond, > > > > > > Can you explain the implications of retrying the operation > > > without > > > waiting for recovery to complete? > > > > > > This patch introduces regression in intra COPY failing if the > > > server > > > rebooted during that operation. What happens is that COPY is re- > > > sent > > > with the same stateid from the old open instead of the open that > > > was > > > done from the recovery (leading to BAD_STATEID error and > > > failure). > > > > > > I wonder if it's because COPY is written to just use > > > nfs4_call_sync() > > > instead of defining rpc_callback_ops to handle rpc_prepare() > > > where a > > > new stateid could be gotten? I have re-written the COPY to do > > > that > > > and > > > I no longer see that problem. > > > > > > If my suspicion is correct, it would help for the future to know > > > that > > > any operations that use stateid must have rpc callback ops > > > defined/used so that they avoid this problem. Perhaps as a > > > comment in > > > the code or maybe some other way? > > > > > > Thanks. > > > > > > > Yes, the way that rpc_call_sync() has been written, it assumes that > > the > > call is unaffected by reboot recovery, or that the resulting state > > errors will be handled by the caller. The patch you reference does > > not > > change that expectation. > > The "or" is confusing me. The current COPY code does call into > nfs4_handle_exception() and "should handle errors". Yet after this > patch, the code fails to in presence of a reboot. I don't see what > else can the current code should do in terms of handling errors to > fix > that problem. > > I'm almost ready to submit the patch that turns the existing code > into > async rpc but if this is not the right approach then I'd like to know > where I should be looking into instead. > > Thanks. > > > > > The same race can happen, for instance, if your call to > > rpc_call_sync() > > coincides with a reboot recovery that was started by another > > process. > > I don't understand. If it is handling the resulting NFS4ERR_BAD_STATEID error correctly, then what is failing? As I said above, you can get the NFS4ERR_BAD_STATEID from other instances of the same race. -- Trond Myklebust Linux NFS client maintainer, PrimaryData trond.myklebust@xxxxxxxxxxxxxxx ��.n��������+%������w��{.n�����{��w���jg��������ݢj����G�������j:+v���w�m������w�������h�����٥