Re: [PATCH] smb: client: fix hang in wait_for_response() for negproto

Enzo Matsumiya <ematsumiya@xxxxxxx> · Fri, 6 Sep 2024 17:37:02 -0300

On 08/31, Paulo Alcantara wrote:
Call cifs_reconnect() to wake up processes waiting on negotiate
protocol to handle the case where server abruptly shut down and had no
chance to properly close the socket.

Simple reproducer:

 ssh 192.168.2.100 pkill -STOP smbd
 mount.cifs //192.168.2.100/test /mnt -o ... [never returns]

Cc: Rickard Andersson <rickaran@xxxxxxxx>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@xxxxxxxxxxxxx>
---
fs/smb/client/connect.c | 14 +++++++++++++-
1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
index c1c14274930a..e004b515e321 100644
--- a/fs/smb/client/connect.c
+++ b/fs/smb/client/connect.c
@@ -656,6 +656,19 @@ allocate_buffers(struct TCP_Server_Info *server)
static bool
server_unresponsive(struct TCP_Server_Info *server)
{
+	/*
+	 * If we're in the process of mounting a share or reconnecting a session
+	 * and the server abruptly shut down (e.g. socket wasn't closed properly),
+	 * wait for at least an echo interval (+7s from rcvtimeo) when attempting
+	 * to negotiate protocol.
+	 */
+	spin_lock(&server->srv_lock);
+	if (server->tcpStatus == CifsInNegotiate &&
+	    time_after(jiffies, server->lstrp + server->echo_interval)) {
+		spin_unlock(&server->srv_lock);
+		cifs_reconnect(server, false);
+		return true;
+	}
	/*
	 * We need to wait 3 echo intervals to make sure we handle such
	 * situations right:
@@ -667,7 +680,6 @@ server_unresponsive(struct TCP_Server_Info *server)
	 * 65s kernel_recvmsg times out, and we see that we haven't gotten
	 *     a response in >60s.
	 */
-	spin_lock(&server->srv_lock);
	if ((server->tcpStatus == CifsGood ||
	    server->tcpStatus == CifsNeedNegotiate) &&
	    (!server->ops->can_echo || server->ops->can_echo(server)) &&

Maybe, for extra precaution, also worth adding a check in
smb2_reconnect() that could catch other cases:

--- a/fs/smb/client/smb2pdu.c
+++ b/fs/smb/client/smb2pdu.c
@@ -278,6 +278,9 @@ smb2_reconnect(__le16 smb2_command, struct cifs_tcon *tcon,
                        spin_unlock(&server->srv_lock);
                        return -EAGAIN;
                }
+       } else if (server->tcpStatus == CifsInNegotiate) {
+               spin_unlock(&server->srv_lock);
+               return -EAGAIN;
        }

        /* if server is marked for termination, cifsd will cleanup */


FTR we hit this exact same bug a few months ago and fixed
downstream with the above check, combined with using
wait_event_timeout() in wait_for_response() (using @echo_interval
seconds as well for timeout).

Never sent upstream because it looked too much like a hack and the bug
was only reproducible on our v5.14-based kernel.

Anyway, HTH.


Cheers,

Enzo