Re: [PATCH 1/3 v2] SUNRPC: remove printk when back channel request not found

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Mon, 18 Dec 2023 15:05:26 +0000

> On Dec 18, 2023, at 7:48 AM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> 
> On 16 Dec 2023, at 11:39, Chuck Lever III wrote:
> 
>>> On Dec 16, 2023, at 6:12 AM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>> 
>>> On 15 Dec 2023, at 16:47, Dai Ngo wrote:
>>> 
>>>> If the client interface is down, or there is a network partition between
>>>> the client and server, that prevents the callback request to reach the
>>>> client TCP on the server will keep re-transmitting the callback for about
>>>> ~9 minutes before giving up and closes the connection.
>>>> 
>>>> If the connection between the client and the server is re-established
>>>> before the connection is closed and after the callback timed out (9 secs)
>>>> then the re-transmitted callback request will arrive at the client. When
>>>> the server receives the reply of the callback, receive_cb_reply prints the
>>>> "Got unrecognized reply..." message in the system log since the callback
>>>> request was already removed from the server xprt's recv_queue.
>>>> 
>>>> Even though this scenario has no effect on the server operation, a
>>>> malicious client can take advantage of this behavior and send thousand
>>>> of callback replies with random XIDs to fill up the server's system log.
>>> 
>>> I don't think this is a serious risk.  There's plenty of things a malicious
>>> client can do besides try to fill up a system log.
>> 
>> It's a general policy to remove printk's that can be triggered via
>> the network. On the client side (xprt_lookup_rqst) we have a dprintk
>> and a trace point. There's no unconditional log message there.
> 
> Ok, fair.

Fwiw, that policy is the particular reason I favor applying
this one.

>>> This particular printk has been an excellent indicator of transport or
>>> client issues over the years.
>> 
>> The problem is it can also be triggered by malicious behavior as well
>> as unrelated issues (like a network partition). It's got a pretty low
>> signal-to-noise ratio IMO; it's somewhat alarming and non-actionable
>> for server administrators.
> 
> I have never seen a case with a malicious NFS client, and I'm also sure if I
> were to run a malicious client I couldn't think of a better way of raising
> my hand to say so.

Right. Maybe the patch description should say "malfunctioning
or malicious client" or something more generic.

> I've seen a lot of misbehaving middle-boxes, or incorrectly setup split
> routing, or migrated-across-the-network clients..

OK, then, there might be some value to this warning outside
of code development. But the current warning message might
be better at directing administrators to look at network
issues rather than "hey I don't recognize that XID".

>>> Seeing it in the log on a customer systems
>>> shaves a lot of time off an initial triage of an issue.  Seeing it in my
>>> testing environment immediately notifies me of what might be an otherwise
>>> hard-to-notice problem.
>> 
>> Can you give some details?
> 
> I don't immediately have more details at hand beyond what I've already said
> - when a customer says they're seeing this message and NFS is slow or
> failing in some way, its a big short cut to finding the problem.

Well I'm mostly interested in understanding why and what kind
of problems you find this way... it could mean we could replace
this warning with something as good, or we could find and fix
a class of problems if you see a common element among issues
reported in this way.

I wasn't really going for "put up or shut up", more like "how
can we improve backchannel observability without blathering
all over the kernel log?"

>> It would be OK with me to replace the unconditional warning with a
>> trace point.
> 
> Of course, but that tracepoint would have to be enabled in order to see that
> a significant disruption in the client-server conversation was occurring.

A counter might work -- that's always on. I can't see that the
particular XIDs reported by this warning are useful.

If you have one or two bugs that are public I could look at,
I would be really interested in what you find with this failure
mode.

> That's all I have for this patch -- just giving some feedback.

Thanks, appreciated!

--
Chuck Lever