Re: deadlock in cifs mounts when server connection goes down

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 5 Oct 2010 18:21:11 -0400

On Tue, 5 Oct 2010 15:18:29 -0400
david.kondrad@xxxxxxxxxx wrote:

> 
> Greetings:
> 
> We have a product that uses cifs mounts to catalog digital music on
> cifs/smb shares on our user's network. We have noticed that while
> cataloging, if a computer leaves the network while the share is
> mounted, any processes that attempt access while be held in the 'D'
> state (which is expected); however, there doesn't appear to be any
> time-out to give up trying to reconnect to the missing server.
> 
> Upon examining the source, I believe the issue is in this loop in
> fs/cifs/connect.c:186 (inside cifs_reconnect() proc)
> 
>    while ((server->tcpStatus != CifsExiting) &&
>           (server->tcpStatus != CifsGood)) {
>       try_to_freeze();
>       if (server->addr.sockAddr6.sin6_family == AF_INET6)
>          rc = ipv6_connect(server);
>       else
>          rc = ipv4_connect(server);
>       if (rc) {
>          cFYI(1, "reconnect error %d", rc);
>          msleep(3000);
>       } else {
>          atomic_inc(&tcpSesReconnectCount);
>          spin_lock(&GlobalMid_Lock);
>          if (server->tcpStatus != CifsExiting)
>             server->tcpStatus = CifsGood;
>          server->sequence_number = 0;
>          spin_unlock(&GlobalMid_Lock);
>    /*    atomic_set(&server->inFlight,0);*/
>          wake_up(&server->response_q);
>       }
>    }
> 
> It looks like if the server leaves the network, this loop will never
> exit because the loop condition never becomes false.
> 
> While I was looking at the callers of cifs_reconnect, I also noticed
> that the return code is consistently ignored which points to the
> assumption that cifs_reconnect() only returns when the connection is
> again viable.
> 
> At first glance, the most direct route would be to add a retry counter
> to the loop and then exit out if not successful within (user
> configurable?) attempts; however, I am not familiar enough with the
> cifs code-base to anticipate any regressions that might result from
> that.
> 
> Any thoughts/comments/suggestions on supplying a fail-safe break out
> of this loop? I'm guessing if we can't reconnect then we should treat
> this connection as unmounted and fail out of any syscalls, right?
> 
> Best regards,
> David
> 

Retrying indefinitely is actually the correct behavior I believe. If
I'm writing or reading files to/from CIFS, then the last thing I want
is for the kernel to corrupt that data or start returning errors just
because the network is having problems.

The correct behavior (IMO) is for the kernel to start sending SMB echo
requests if it doesn't get a reply within a certain amount of time. If
those don't get a reply after a certain number of attempts (probably
user-configurable), then we cancel in-flight MID's and attempt to
reconnect to the server.

However, none of this should cause the client to start returning errors
to userspace. That doesn't make for robustness in the face of network
partitions. If, however, the processes waiting on syscalls are
interrupted with a SIGKILL, we probably ought to return an error to
userspace (probably an EINTR).

What you may want to look at is wait_for_response -- it probably ought
to be fixed to handle fatal signals and return errors when the task
gets one.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html