Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

Benjamin Coddington <bcodding@xxxxxxxxxx> · Mon, 18 Dec 2023 07:48:36 -0500

On 16 Dec 2023, at 11:39, Chuck Lever III wrote:

>> On Dec 16, 2023, at 6:12 AM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>
>> On 15 Dec 2023, at 16:47, Dai Ngo wrote:
>>
>>> If the client interface is down, or there is a network partition between
>>> the client and server, that prevents the callback request to reach the
>>> client TCP on the server will keep re-transmitting the callback for about
>>> ~9 minutes before giving up and closes the connection.
>>>
>>> If the connection between the client and the server is re-established
>>> before the connection is closed and after the callback timed out (9 secs)
>>> then the re-transmitted callback request will arrive at the client. When
>>> the server receives the reply of the callback, receive_cb_reply prints the
>>> "Got unrecognized reply..." message in the system log since the callback
>>> request was already removed from the server xprt's recv_queue.
>>>
>>> Even though this scenario has no effect on the server operation, a
>>> malicious client can take advantage of this behavior and send thousand
>>> of callback replies with random XIDs to fill up the server's system log.
>>
>> I don't think this is a serious risk.  There's plenty of things a malicious
>> client can do besides try to fill up a system log.
>
> It's a general policy to remove printk's that can be triggered via
> the network. On the client side (xprt_lookup_rqst) we have a dprintk
> and a trace point. There's no unconditional log message there.

Ok, fair.

>> This particular printk has been an excellent indicator of transport or
>> client issues over the years.
>
> The problem is it can also be triggered by malicious behavior as well
> as unrelated issues (like a network partition). It's got a pretty low
> signal-to-noise ratio IMO; it's somewhat alarming and non-actionable
> for server administrators.

I have never seen a case with a malicious NFS client, and I'm also sure if I
were to run a malicious client I couldn't think of a better way of raising
my hand to say so.

I've seen a lot of misbehaving middle-boxes, or incorrectly setup split
routing, or migrated-across-the-network clients..

>> Seeing it in the log on a customer systems
>> shaves a lot of time off an initial triage of an issue.  Seeing it in my
>> testing environment immediately notifies me of what might be an otherwise
>> hard-to-notice problem.
>
> Can you give some details?

I don't immediately have more details at hand beyond what I've already said
- when a customer says they're seeing this message and NFS is slow or
failing in some way, its a big short cut to finding the problem.

> It would be OK with me to replace the unconditional warning with a
> trace point.

Of course, but that tracepoint would have to be enabled in order to see that
a significant disruption in the client-server conversation was occuring.

That's all I have for this patch -- just giving some feedback.

Ben