Re: [PATCH] Revert "SUNRPC: Fail faster on bad verifier"

Trond Myklebust <trondmy@xxxxxxxxxx> · Wed, 06 Sep 2023 14:27:58 -0400

On Wed, 2023-09-06 at 18:05 +0000, Chuck Lever III wrote:
> 
> 
> > On Sep 6, 2023, at 12:18 PM, Trond Myklebust <trondmy@xxxxxxxxxx>
> > wrote:
> > 
> > On Wed, 2023-09-06 at 15:20 +0000, Chuck Lever III wrote:
> > > 
> > > 
> > > > On Sep 6, 2023, at 10:33 AM, Trond Myklebust
> > > > <trondmy@xxxxxxxxxx>
> > > > wrote:
> > > > 
> > > > On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote:
> > > > > 
> > > > > 
> > > > > > On Sep 5, 2023, at 9:03 PM, trondmy@xxxxxxxxxx wrote:
> > > > > > 
> > > > > > From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > > > 
> > > > > > This reverts commit
> > > > > > 0701214cd6e66585a999b132eb72ae0489beb724.
> > > > > > 
> > > > > > The premise of this commit was incorrect. There are exactly
> > > > > > 2
> > > > > > cases
> > > > > > where rpcauth_checkverf() will return an error:
> > > > > > 
> > > > > > 1) If there was an XDR decode problem (i.e. garbage data).
> > > > > > 2) If gss_validate() had a problem verifying the RPCSEC_GSS
> > > > > > MIC.
> > > > > 
> > > > > There's also the AUTH_TLS probe:
> > > > > 
> > > > > https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7
> > > > > 
> > > > > That was the purpose of 0701214cd6e6.
> > > > > 
> > > > > Reverting this commit is likely to cause problems when our
> > > > > TLS-capable client interacts with a server that knows
> > > > > nothing of AUTH_TLS.
> > > > 
> > > > The patch completely broke the semantics of the header
> > > > validation
> > > > code.
> > > 
> > > If that were truly the case, it's amazing that the client
> > > has hobbled along for the past 14 months with no-one
> > > noticing the breakage until now.
> > > 
> > > Seriously, though, treating a bad verifier as garbage args
> > > is not intuitive. If it's that critical there needs to be
> > > a comment in the code explaining why.
> > > 
> > 
> > It is necessary because of the peculiarities of RPCSEC_GSS and the
> > session semantics it implements.
> > See
> > https://datatracker.ietf.org/doc/html/rfc2203#section-5.3.3.1 and
> > in particular, the paragraph discussing retransmissions by the
> > client.
> 
> Retrying is fine.
> 
> But the counter in the client is called "garbage_retries".
> That's not what is going on the EACCES case, though the
> behavior is close enough -- it's code re-use (good) without
> appropriate documentation (bad).
> 
> The decoder treats EIO and EACCES exactly the same way.
> Again, code reuse (good) without appropriate documentation
> (bad).
> 
> I tried to address that in my RFC patch by adding a small
> explanatory comment and by adding an API contract for
> rpcauth_checkverf().
> 
> 
> > > > There is no discussion about whether or not it needs to be
> > > > reverted.
> > > 
> > > The patch description is wrong, though, to exclude AUTH_TLS.
> > > 
> > > The reverting patch description claims to be an exhaustive
> > > list of all the cases, but it doesn't mention the AUTH_TLS
> > > case at all.
> > > 
> > > 
> > > > If the TLS code needs special treatment, then a separate patch
> > > > is
> > > > needed to fix tls_validate() to return something that can be
> > > > caught
> > > > by
> > > > rpc_decode_header and interpreted differently to the EIO and
> > > > EACCES
> > > > error codes currently being returned by RPCSEC_GSS, AUTH_SYS
> > > > and
> > > > others.
> > > 
> > > That could have been brought up when 0701214cd6e6 was first
> > > posted for review. Interesting that the decoder currently
> > > does not distinguish between EIO and EACCES.
> > > 
> > > Thanks for the suggestion, I'll have a look.
> > > 
> > 
> > Now that I look at it, I think your approach to satisfying RFC9289
> > is
> > not correct.
> 
> I'm not following what aspect of the implementation is problematic.
> I'm going to assume you mean the implementation of opportunism.
> 
> 
> > Since this is a transport level issue, why should we not just mark
> > the
> > xprt for disconnection, and then retry? It is entirely possible
> > that
> > some load balancer/floating IP has just moved the connection to
> > some
> > node that was not expecting to do TLS.
> 
> Depending on the security policy chosen by the client's
> administrator,
> that could either be a security problem or a "don't care" situation.
> 
> If the administrator wants the client to _require_ TLS, then
> connecting to a load balancer where TLS suddenly becomes unavailable
> after a reconnect is a hard error. This prevents STRIPTLS attacks.
> That's good security.
> 
> If the administrator wants to allow operation to continue even if TLS
> is not available, then the client can recover by not using TLS.
> That's
> rather terrible security, but can be desirable to improve backward
> compatibility.
> 
> 
> > The only case where that should
> > not be assumed is the case where the error happens right at the
> > very
> > beginning of the mount, when disconnecting should normally suffice
> > to
> > trigger the RPC_TASK_SOFTCONN code anyway.
> 
> If TLS goes away after a reconnect, that's a problem. Whether
> further operation should stop depends on the administrator's
> chosen security policy.
> 
> The security policies are NFS-level settings (eg, mount options).
> RPC just indicates to NFS whether the traffic is protected or not.
> 
> When NFS asks RPC to ensure the communication channel is protected,
> that means every reconnect is protected. Communication with that
> security policy cannot happen without protection.
> 
> Trust me, the security community will have it no other way.
> 
> If you need opportunism in this case, then I can add back the
> "xprtsec=auto" mount option, which you asked me to remove a while
> back.

I don't see how described behaviour would cause the operation to
proceed without TLS. I'm saying disconnect and then retry TLS
negotiation, and then eventually fail.

> 
> 
> > > > > > In the second case, there are again 2 subcases:
> > > > > > 
> > > > > > a) The GSS context expires, in which case gss_validate()
> > > > > > will
> > > > > > force
> > > > > > a
> > > > > >   new context negotiation on retry by invalidating the
> > > > > > cred.
> > > > > > b) The sequence number check failed because an RPC call
> > > > > > timed
> > > > > > out,
> > > > > > and
> > > > > >   the client retransmitted the request using a new sequence
> > > > > > number,
> > > > > >   as required by RFC2203.
> > > > > > 
> > > > > > In neither subcase is this a fatal error.
> > > > > > 
> > > > > > Reported-by: Russell Cattelan <cattelan@xxxxxxxxxxx>
> > > > > > Fixes: 0701214cd6e6 ("SUNRPC: Fail faster on bad verifier")
> > > > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > > > Signed-off-by: Trond Myklebust
> > > > > > <trond.myklebust@xxxxxxxxxxxxxxx>
> > > > > > ---
> > > > > > net/sunrpc/clnt.c | 2 +-
> > > > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> > > > > > index 12c46e129db8..5a7de7e55548 100644
> > > > > > --- a/net/sunrpc/clnt.c
> > > > > > +++ b/net/sunrpc/clnt.c
> > > > > > @@ -2724,7 +2724,7 @@ rpc_decode_header(struct rpc_task
> > > > > > *task,
> > > > > > struct xdr_stream *xdr)
> > > > > > 
> > > > > > out_verifier:
> > > > > > trace_rpc_bad_verifier(task);
> > > > > > - goto out_err;
> > > > > > + goto out_garbage;
> > > > > > 
> > > > > > out_msg_denied:
> > > > > > error = -EACCES;
> > > > > > -- 
> > > > > > 2.41.0
> > > > > > 
> > > > > 
> > > > > --
> > > > > Chuck Lever
> > > > > 
> > > > > 
> > > > 
> > > > -- 
> > > > Trond Myklebust
> > > > Linux NFS client maintainer, Hammerspace
> > > > trond.myklebust@xxxxxxxxxxxxxxx
> > > 
> > > 
> > > --
> > > Chuck Lever
> 
> 
> --
> Chuck Lever
> 
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx