Re: [PATCH] Revert "SUNRPC: Fail faster on bad verifier"

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 6 Sep 2023 18:43:05 +0000

> On Sep 6, 2023, at 2:27 PM, Trond Myklebust <trondmy@xxxxxxxxxx> wrote:
> 
> On Wed, 2023-09-06 at 18:05 +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Sep 6, 2023, at 12:18 PM, Trond Myklebust <trondmy@xxxxxxxxxx>
>>> wrote:
>>> 
>>> On Wed, 2023-09-06 at 15:20 +0000, Chuck Lever III wrote:
>>>> 
>>>> 
>>>>> On Sep 6, 2023, at 10:33 AM, Trond Myklebust
>>>>> <trondmy@xxxxxxxxxx>
>>>>> wrote:
>>>>> 
>>>>> On Wed, 2023-09-06 at 13:40 +0000, Chuck Lever III wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Sep 5, 2023, at 9:03 PM, trondmy@xxxxxxxxxx wrote:
>>>>>>> 
>>>>>>> From: Trond Myklebust <trond.myklebust@xxxxxxxxxxxxxxx>
>>>>>>> 
>>>>>>> This reverts commit
>>>>>>> 0701214cd6e66585a999b132eb72ae0489beb724.
>>>>>>> 
>>>>>>> The premise of this commit was incorrect. There are exactly
>>>>>>> 2
>>>>>>> cases
>>>>>>> where rpcauth_checkverf() will return an error:
>>>>>>> 
>>>>>>> 1) If there was an XDR decode problem (i.e. garbage data).
>>>>>>> 2) If gss_validate() had a problem verifying the RPCSEC_GSS
>>>>>>> MIC.
>>>>>> 
>>>>>> There's also the AUTH_TLS probe:
>>>>>> 
>>>>>> https://www.rfc-editor.org/rfc/rfc9289.html#section-4.1-7
>>>>>> 
>>>>>> That was the purpose of 0701214cd6e6.
>>>>>> 
>>>>>> Reverting this commit is likely to cause problems when our
>>>>>> TLS-capable client interacts with a server that knows
>>>>>> nothing of AUTH_TLS.
>>>>> 
>>>>> The patch completely broke the semantics of the header
>>>>> validation
>>>>> code.
>>>> 
>>>> If that were truly the case, it's amazing that the client
>>>> has hobbled along for the past 14 months with no-one
>>>> noticing the breakage until now.
>>>> 
>>>> Seriously, though, treating a bad verifier as garbage args
>>>> is not intuitive. If it's that critical there needs to be
>>>> a comment in the code explaining why.
>>>> 
>>> 
>>> It is necessary because of the peculiarities of RPCSEC_GSS and the
>>> session semantics it implements.
>>> See
>>> https://datatracker.ietf.org/doc/html/rfc2203#section-5.3.3.1 and
>>> in particular, the paragraph discussing retransmissions by the
>>> client.
>> 
>> Retrying is fine.
>> 
>> But the counter in the client is called "garbage_retries".
>> That's not what is going on the EACCES case, though the
>> behavior is close enough -- it's code re-use (good) without
>> appropriate documentation (bad).
>> 
>> The decoder treats EIO and EACCES exactly the same way.
>> Again, code reuse (good) without appropriate documentation
>> (bad).
>> 
>> I tried to address that in my RFC patch by adding a small
>> explanatory comment and by adding an API contract for
>> rpcauth_checkverf().
>> 
>> 
>>>>> There is no discussion about whether or not it needs to be
>>>>> reverted.
>>>> 
>>>> The patch description is wrong, though, to exclude AUTH_TLS.
>>>> 
>>>> The reverting patch description claims to be an exhaustive
>>>> list of all the cases, but it doesn't mention the AUTH_TLS
>>>> case at all.
>>>> 
>>>> 
>>>>> If the TLS code needs special treatment, then a separate patch
>>>>> is
>>>>> needed to fix tls_validate() to return something that can be
>>>>> caught
>>>>> by
>>>>> rpc_decode_header and interpreted differently to the EIO and
>>>>> EACCES
>>>>> error codes currently being returned by RPCSEC_GSS, AUTH_SYS
>>>>> and
>>>>> others.
>>>> 
>>>> That could have been brought up when 0701214cd6e6 was first
>>>> posted for review. Interesting that the decoder currently
>>>> does not distinguish between EIO and EACCES.
>>>> 
>>>> Thanks for the suggestion, I'll have a look.
>>>> 
>>> 
>>> Now that I look at it, I think your approach to satisfying RFC9289
>>> is
>>> not correct.
>> 
>> I'm not following what aspect of the implementation is problematic.
>> I'm going to assume you mean the implementation of opportunism.
>> 
>> 
>>> Since this is a transport level issue, why should we not just mark
>>> the
>>> xprt for disconnection, and then retry? It is entirely possible
>>> that
>>> some load balancer/floating IP has just moved the connection to
>>> some
>>> node that was not expecting to do TLS.
>> 
>> Depending on the security policy chosen by the client's
>> administrator,
>> that could either be a security problem or a "don't care" situation.
>> 
>> If the administrator wants the client to _require_ TLS, then
>> connecting to a load balancer where TLS suddenly becomes unavailable
>> after a reconnect is a hard error. This prevents STRIPTLS attacks.
>> That's good security.
>> 
>> If the administrator wants to allow operation to continue even if TLS
>> is not available, then the client can recover by not using TLS.
>> That's
>> rather terrible security, but can be desirable to improve backward
>> compatibility.
>> 
>> 
>>> The only case where that should
>>> not be assumed is the case where the error happens right at the
>>> very
>>> beginning of the mount, when disconnecting should normally suffice
>>> to
>>> trigger the RPC_TASK_SOFTCONN code anyway.
>> 
>> If TLS goes away after a reconnect, that's a problem. Whether
>> further operation should stop depends on the administrator's
>> chosen security policy.
>> 
>> The security policies are NFS-level settings (eg, mount options).
>> RPC just indicates to NFS whether the traffic is protected or not.
>> 
>> When NFS asks RPC to ensure the communication channel is protected,
>> that means every reconnect is protected. Communication with that
>> security policy cannot happen without protection.
>> 
>> Trust me, the security community will have it no other way.
>> 
>> If you need opportunism in this case, then I can add back the
>> "xprtsec=auto" mount option, which you asked me to remove a while
>> back.
> 
> I don't see how described behaviour would cause the operation to
> proceed without TLS. I'm saying disconnect and then retry TLS
> negotiation, and then eventually fail.

As I said above: 

>> I'm not following what aspect of the implementation is problematic.

I still don't understand what you want changed. (Not arguing,
just not understanding).

One or the other side will disconnect if the probe fails. My
memory might be failing, but I'm not aware of the client
continuing to use a connection after sending an RPC_AUTH_TLS
probe that fails. Also I'm not aware that RFC 9289 requires
disconnection after a probe failure, though I believe that
the Linux implementation does disconnect.

By "continuing operation" I mean simply that either the client
tries to reconnect with TLS (in the _requires_ case) or the
client reconnects with no probe and no TLS (in the _opportunistic_
case). A TLS failure, after the initial contact with the server,
will not cause a non-probe RPC to terminate.

If we don't disconnect after a probe failure, I can look into
changing that. But I think we already work that way.

--
Chuck Lever