On 18 Apr 2024, at 10:11, Olga Kornievskaia wrote: > From: Olga Kornievskaia <kolga@xxxxxxxxxx> > > In the case where we have received a successful reply to an RPC request, > but while processing the reply the client in rpc_decode_header() finds > an expired context, the code ends up propagating the error to the caller > instead of getting a new context and retrying the request. > > To give more details, in rpc_decode_header() we call rpcauth_checkverf() > will call into the gss and internally will at some point call > gss_validate() which has a check if the current’s context lifetime > expired, and it would fail. The reason for the failure gets ‘scrubbed’ > and translated to EACCES so when we get back to rpc_decode_header() we > just go to “out_verifier” which for that error would get converted to > “out_garbage” (ie it’s treated as garballed reply) and the next > action is call_encode. Which (1) doesn’t reencode or re-send (not to > mention no upcall happens because context expires as that reason just > not known) and it again fails in the same decoding process. After > re-trying it 3 times the error is propagated back to the caller > (ie nfs4_write_done_cb() in the case a failing write). > > To fix this, instead we need to look to the case where the server > decides that context has expired and replies with an RPC auth error. > In that case, the rpc_decode_header() goes to "out_msg_denied" in that > we return EKEYREJECTED which in call_decode() is sent to “call_reserve” > which triggers an upcalls and a re-try of the operation. > > The proposed fix is in case of a failed rpc_decode_header() to check > if credentials were set to be invalid and use that as a proxy for > deciding that context has expired and then treat is same way as > receiving an auth error. > > Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx> > > --- This looks like a day-0 problem and should probably be back > ported to earlier kernels. Looks good to me. Reviewed-by: Benjamin Coddington <bcodding@xxxxxxxxxx> Ben