On Tue, 5 Oct 2010 15:18:29 -0400 david.kondrad@xxxxxxxxxx wrote: > > Greetings: > > We have a product that uses cifs mounts to catalog digital music on > cifs/smb shares on our user's network. We have noticed that while > cataloging, if a computer leaves the network while the share is > mounted, any processes that attempt access while be held in the 'D' > state (which is expected); however, there doesn't appear to be any > time-out to give up trying to reconnect to the missing server. > > Upon examining the source, I believe the issue is in this loop in > fs/cifs/connect.c:186 (inside cifs_reconnect() proc) > > while ((server->tcpStatus != CifsExiting) && > (server->tcpStatus != CifsGood)) { > try_to_freeze(); > if (server->addr.sockAddr6.sin6_family == AF_INET6) > rc = ipv6_connect(server); > else > rc = ipv4_connect(server); > if (rc) { > cFYI(1, "reconnect error %d", rc); > msleep(3000); > } else { > atomic_inc(&tcpSesReconnectCount); > spin_lock(&GlobalMid_Lock); > if (server->tcpStatus != CifsExiting) > server->tcpStatus = CifsGood; > server->sequence_number = 0; > spin_unlock(&GlobalMid_Lock); > /* atomic_set(&server->inFlight,0);*/ > wake_up(&server->response_q); > } > } > > It looks like if the server leaves the network, this loop will never > exit because the loop condition never becomes false. > > While I was looking at the callers of cifs_reconnect, I also noticed > that the return code is consistently ignored which points to the > assumption that cifs_reconnect() only returns when the connection is > again viable. > > At first glance, the most direct route would be to add a retry counter > to the loop and then exit out if not successful within (user > configurable?) attempts; however, I am not familiar enough with the > cifs code-base to anticipate any regressions that might result from > that. > > Any thoughts/comments/suggestions on supplying a fail-safe break out > of this loop? I'm guessing if we can't reconnect then we should treat > this connection as unmounted and fail out of any syscalls, right? > > Best regards, > David > Retrying indefinitely is actually the correct behavior I believe. If I'm writing or reading files to/from CIFS, then the last thing I want is for the kernel to corrupt that data or start returning errors just because the network is having problems. The correct behavior (IMO) is for the kernel to start sending SMB echo requests if it doesn't get a reply within a certain amount of time. If those don't get a reply after a certain number of attempts (probably user-configurable), then we cancel in-flight MID's and attempt to reconnect to the server. However, none of this should cause the client to start returning errors to userspace. That doesn't make for robustness in the face of network partitions. If, however, the processes waiting on syscalls are interrupted with a SIGKILL, we probably ought to return an error to userspace (probably an EINTR). What you may want to look at is wait_for_response -- it probably ought to be fixed to handle fatal signals and return errors when the task gets one. -- Jeff Layton <jlayton@xxxxxxxxxx> -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html