Re: How to automatically drop unresponsive CIFS /SMB connections

Lucy Kueny <lucy@xxxxxxxx> · Mon, 5 Feb 2024 00:26:42 +0100

On 03/02/2024 23:48, R. Diez wrote:
> Hi all:
> 
> I have been mounting Windows shares for years with this script, which
> just boils down to "sudo mount -t cifs":
> 
> https://github.com/rdiez/Tools/blob/master/MountWindowsShares/mount-windows-shares-sudo.sh
> 
> I noticed under Linux that some applications (like Emacs), the desktop's
> file manager (like Caja) or even the whole desktop sometimes hang for a
> number of seconds. It is very annoying. It turns out the reason is that
> the hanging software is trying to look at a file or a directory on an
> unresponsive CIFS / SMB mount.
> 
> The easiest way to reproduce this issue is from outside the office: I
> start the VPN, connect to the Windows shares, and then tear down the VPN.
> 
> I have tried mount option "echo_interval=4", but that does not really
> help. The Kernel does seem to notice more quickly that the connection
> has become unresponsive:
> 
> Feb 03 23:24:37 rdiez4 kernel: CIFS: VFS: \\192.168.1.3 has not
> responded in 12 seconds. Reconnecting...
> 
> The trouble is, it tries to reconnect automatically. That means that the
> next application which attempts to access something under the
> unresponsive mount will hang again. I think the pauses last 10 seconds,
> it must be hard-coded in the CIFS Kernel code. If the application
> retries itself, or tries to look at more than 1 file before failing the
> whole operation, then the time adds up accordingly. If the shell's
> current directory is on such a failing path, it bugs you for a while.
> 
> What I need is for the connection to automatically drop when it becomes
> unresponsive, and do not retry to connect again.
> 
> Alternatively, applications should fail immediately if a connection has
> been deemed unresponsive in the meantime, and hasn't been successfully
> re-established yet.
> 
> Is there a way to achieve that behaviour?
> 
> Thanks in advance,
>   rdiez
> 

Hi everyone,

I have written a patch that does this. It adds a mount flag to return as unavailable immediately after N reconnect attempts.
It's written against Linux 6.7 but still applies on cifs-2.6. I asked the same question on this mailing list a while ago.

Add "max_blocking_recon=1" to your mount arguments. I run it on my machines.
It probably needs polishing from somebody more experienced than me.


Best regards,
Lucy Kueny



>From 98e2e44d39f4f5172e3ce416a2e65a48b51e2de1 Mon Sep 17 00:00:00 2001
From: Lucy Kueny <lucy@xxxxxxxx>
Date: Fri, 22 Sep 2023 11:06:20 +0200
Subject: [PATCH] Stop reconnect timeouts from freezing userspace

---
 fs/smb/client/cifsfs.c     |  3 +++
 fs/smb/client/cifsglob.h   |  6 ++++++
 fs/smb/client/connect.c    |  2 ++
 fs/smb/client/fs_context.c |  6 ++++++
 fs/smb/client/fs_context.h |  2 ++
 fs/smb/client/misc.c       | 13 +++++++++++++
 6 files changed, 32 insertions(+)

diff --git a/fs/smb/client/cifsfs.c b/fs/smb/client/cifsfs.c
index 22869cda1356..ea338b335074 100644
--- a/fs/smb/client/cifsfs.c
+++ b/fs/smb/client/cifsfs.c
@@ -694,6 +694,9 @@ cifs_show_options(struct seq_file *s, struct dentry *root)
 		seq_puts(s, ",noblocksend");
 	if (tcon->ses->server->nosharesock)
 		seq_puts(s, ",nosharesock");
+	if (tcon->ses->server->max_blocking_reconnect != DEFAULT_MAX_BLOCKING_RECONNECT)
+		seq_printf(s, ",max_blocking_reconnect=%lu",
+			   tcon->ses->server->max_blocking_reconnect);
 
 	if (tcon->snapshot_time)
 		seq_printf(s, ",snapshot=%llu", tcon->snapshot_time);
diff --git a/fs/smb/client/cifsglob.h b/fs/smb/client/cifsglob.h
index 032d8716f671..5128123148e1 100644
--- a/fs/smb/client/cifsglob.h
+++ b/fs/smb/client/cifsglob.h
@@ -84,6 +84,10 @@
 /* maximum number of PDUs in one compound */
 #define MAX_COMPOUND 5
 
+/* maximum failed reconnects before file access fails without waiting */
+#define DEFAULT_MAX_BLOCKING_RECONNECT 0
+
+
 /*
  * Default number of credits to keep available for SMB3.
  * This value is chosen somewhat arbitrarily. The Windows client
@@ -731,6 +735,8 @@ struct TCP_Server_Info {
 	struct delayed_work reconnect; /* reconnect workqueue job */
 	struct mutex reconnect_mutex; /* prevent simultaneous reconnects */
 	unsigned long echo_interval;
+	unsigned long max_blocking_reconnect; /* maximum failed reconnects before file access fails without waiting */
+	unsigned long reconnect_fail_cnt; /* subsequent reconnect timeout on file access */
 
 	/*
 	 * Number of targets available for reconnect. The more targets
diff --git a/fs/smb/client/connect.c b/fs/smb/client/connect.c
index 687754791bf0..999f87633baa 100644
--- a/fs/smb/client/connect.c
+++ b/fs/smb/client/connect.c
@@ -1740,6 +1740,8 @@ cifs_get_tcp_session(struct smb3_fs_context *ctx,
 			goto out_err_crypto_release;
 		}
 	}
+	tcp_ses->max_blocking_reconnect = ctx->max_blocking_reconnect;
+	tcp_ses->reconnect_fail_cnt = 0;
 	rc = ip_connect(tcp_ses);
 	if (rc < 0) {
 		cifs_dbg(VFS, "Error connecting to socket. Aborting operation.\n");
diff --git a/fs/smb/client/fs_context.c b/fs/smb/client/fs_context.c
index e45ce31bbda7..0ae441c97bff 100644
--- a/fs/smb/client/fs_context.c
+++ b/fs/smb/client/fs_context.c
@@ -154,6 +154,7 @@ const struct fs_parameter_spec smb3_fs_parameters[] = {
 	fsparam_u32("handletimeout", Opt_handletimeout),
 	fsparam_u64("snapshot", Opt_snapshot),
 	fsparam_u32("max_channels", Opt_max_channels),
+	fsparam_u32("max_blocking_recon", Opt_max_blocking_reconnect),
 
 	/* Mount options which take string value */
 	fsparam_string("source", Opt_source),
@@ -1166,6 +1167,9 @@ static int smb3_fs_context_parse_param(struct fs_context *fc,
 		if (result.uint_32 > 1)
 			ctx->multichannel = true;
 		break;
+	case Opt_max_blocking_reconnect:
+		ctx->max_blocking_reconnect = result.uint_32;
+		break;
 	case Opt_max_cached_dirs:
 		if (result.uint_32 < 1) {
 			cifs_errorf(fc, "%s: Invalid max_cached_dirs, needs to be 1 or more\n",
@@ -1615,6 +1619,8 @@ int smb3_init_fs_context(struct fs_context *fc)
 	ctx->multichannel = false;
 	ctx->max_channels = 1;
 
+	ctx->max_blocking_reconnect = DEFAULT_MAX_BLOCKING_RECONNECT;
+
 	ctx->backupuid_specified = false; /* no backup intent for a user */
 	ctx->backupgid_specified = false; /* no backup intent for a group */
 
diff --git a/fs/smb/client/fs_context.h b/fs/smb/client/fs_context.h
index 9d8d34af0211..478b3a9d3af5 100644
--- a/fs/smb/client/fs_context.h
+++ b/fs/smb/client/fs_context.h
@@ -131,6 +131,7 @@ enum cifs_param {
 	Opt_max_cached_dirs,
 	Opt_snapshot,
 	Opt_max_channels,
+	Opt_max_blocking_reconnect,
 	Opt_handletimeout,
 
 	/* Mount options which take string value */
@@ -262,6 +263,7 @@ struct smb3_fs_context {
 	__u32 handle_timeout; /* persistent and durable handle timeout in ms */
 	unsigned int max_credits; /* smb3 max_credits 10 < credits < 60000 */
 	unsigned int max_channels;
+	unsigned int max_blocking_reconnect;
 	unsigned int max_cached_dirs;
 	__u16 compression; /* compression algorithm 0xFFFF default 0=disabled */
 	bool rootfs:1; /* if it's a SMB root file system */
diff --git a/fs/smb/client/misc.c b/fs/smb/client/misc.c
index 366b755ca913..51320ec6b08a 100644
--- a/fs/smb/client/misc.c
+++ b/fs/smb/client/misc.c
@@ -1318,6 +1318,13 @@ int cifs_wait_for_server_reconnect(struct TCP_Server_Info *server, bool retry)
 		return 0;
 	}
 	timeout *= server->nr_targets;
+	/* return immediatly on repeated timeouts */
+	if (server->max_blocking_reconnect &&
+		server->reconnect_fail_cnt >= server->max_blocking_reconnect) {
+		spin_unlock(&server->srv_lock);
+		cifs_dbg(FYI, "%s: not waiting for reconnect as requested\n", __func__);
+		return -EHOSTDOWN;
+	}
 	spin_unlock(&server->srv_lock);
 
 	/*
@@ -1341,12 +1348,18 @@ int cifs_wait_for_server_reconnect(struct TCP_Server_Info *server, bool retry)
 		/* are we still trying to reconnect? */
 		spin_lock(&server->srv_lock);
 		if (server->tcpStatus != CifsNeedReconnect) {
+			server->reconnect_fail_cnt = 0;
 			spin_unlock(&server->srv_lock);
 			return 0;
 		}
 		spin_unlock(&server->srv_lock);
 	} while (retry);
 
+	/* increase failed attempt counter */
+	spin_lock(&server->srv_lock);
+	server->reconnect_fail_cnt += 1;
+	spin_unlock(&server->srv_lock);
+
 	cifs_dbg(FYI, "%s: gave up waiting on reconnect\n", __func__);
 	return -EHOSTDOWN;
 }
-- 
2.42.0