On 16 Dec 2023, at 11:39, Chuck Lever III wrote: >> On Dec 16, 2023, at 6:12 AM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >> >> On 15 Dec 2023, at 16:47, Dai Ngo wrote: >> >>> If the client interface is down, or there is a network partition between >>> the client and server, that prevents the callback request to reach the >>> client TCP on the server will keep re-transmitting the callback for about >>> ~9 minutes before giving up and closes the connection. >>> >>> If the connection between the client and the server is re-established >>> before the connection is closed and after the callback timed out (9 secs) >>> then the re-transmitted callback request will arrive at the client. When >>> the server receives the reply of the callback, receive_cb_reply prints the >>> "Got unrecognized reply..." message in the system log since the callback >>> request was already removed from the server xprt's recv_queue. >>> >>> Even though this scenario has no effect on the server operation, a >>> malicious client can take advantage of this behavior and send thousand >>> of callback replies with random XIDs to fill up the server's system log. >> >> I don't think this is a serious risk. There's plenty of things a malicious >> client can do besides try to fill up a system log. > > It's a general policy to remove printk's that can be triggered via > the network. On the client side (xprt_lookup_rqst) we have a dprintk > and a trace point. There's no unconditional log message there. Ok, fair. >> This particular printk has been an excellent indicator of transport or >> client issues over the years. > > The problem is it can also be triggered by malicious behavior as well > as unrelated issues (like a network partition). It's got a pretty low > signal-to-noise ratio IMO; it's somewhat alarming and non-actionable > for server administrators. I have never seen a case with a malicious NFS client, and I'm also sure if I were to run a malicious client I couldn't think of a better way of raising my hand to say so. I've seen a lot of misbehaving middle-boxes, or incorrectly setup split routing, or migrated-across-the-network clients.. >> Seeing it in the log on a customer systems >> shaves a lot of time off an initial triage of an issue. Seeing it in my >> testing environment immediately notifies me of what might be an otherwise >> hard-to-notice problem. > > Can you give some details? I don't immediately have more details at hand beyond what I've already said - when a customer says they're seeing this message and NFS is slow or failing in some way, its a big short cut to finding the problem. > It would be OK with me to replace the unconditional warning with a > trace point. Of course, but that tracepoint would have to be enabled in order to see that a significant disruption in the client-server conversation was occuring. That's all I have for this patch -- just giving some feedback. Ben