Re: [PATCH 2/2] mount: RPC_PROGNOTREGISTERED should not be a permanent error

NeilBrown <neilb@xxxxxxxx> · Thu, 24 Nov 2016 10:26:16 +1100

On Thu, Nov 24 2016, Steve Dickson wrote:

> On 11/22/2016 05:43 PM, NeilBrown wrote:
>> On Wed, Nov 23 2016, Steve Dickson wrote:
>> 
>>> [Resent due to mailman rejecting the HTML subpart]
>> (and the resend included HTML too ... how embarrassing :-)
> Yeah... :-) I guess an upgrade turned it on.. 
>
>> 
>>>
>>> Hey Neil,
>>>
>>>
>>> On 08/18/2016 09:45 PM, NeilBrown wrote:
>>>> Commit: bf66c9facb8e ("mounts.nfs: v2 and v3 background mounts should retry when server is down.")
>>>>
>>>> changed the behaviour of "bg" mounts so that RPC_PROGNOTREGISTERED,
>>>> which maps to EOPNOTSUPP, is not a permanent error.
>>>> This useful because when an NFS server starts up there is a small window between
>>>> the moment that rpcbind (or portmap) starts responding to lookup requests,
>>>> and the moment when nfsd registers with rpcbind.  During that window
>>>> rpcbind will reply with RPC_PROGNOTREGISTERED, but mount should not give up.
>>>>
>>>> This same reasoning applies to foreground mounts.  They don't wait for
>>>> as long, but could still hit the window and fail prematurely.
>>>>
>>>> So revert the above patch and instead add EOPNOTSUPP to the list of
>>>> temporary errors known to nfs_is_permanent_error.
>>>>
>>>> Signed-off-by: NeilBrown <neilb@xxxxxxxx>
>>>> ---
>>>>  utils/mount/stropts.c |    7 +++----
>>>>  1 file changed, 3 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/utils/mount/stropts.c b/utils/mount/stropts.c
>>>> index 9de6794c6177..d5dfb5e4a669 100644
>>>> --- a/utils/mount/stropts.c
>>>> +++ b/utils/mount/stropts.c
>>>> @@ -948,6 +948,7 @@ static int nfs_is_permanent_error(int error)
>>>>  	case ETIMEDOUT:
>>>>  	case ECONNREFUSED:
>>>>  	case EHOSTUNREACH:
>>>> +	case EOPNOTSUPP:	/* aka RPC_PROGNOTREGISTERED */
>>> I think this introduced a regression... When the server does not support
>>> a protocol, say UDP, this patch cause the mount to hang forever,
>>> which I don't think we want.
>> 
>> 
>> I think we do want it to wait a while so that the nfs server has a
>> chance to start up.  We have no guarantee that the NFS server will be
>> registered with rpcbind before rpcbind responds to requests.
> I do see this race but there it has to be a small window. With
> Fedora its under seconds between the time rpcbind started
> and the NFS server.
>
>> 
>> I disagree with the "hang forever" description.  I just tested after
>> disabling UDP on an nfs server, and the delay was 2 minutes, 5 seconds
>> before a failure was reported.  It might be longer when trying TCP on a
>> server that only supports UDP.
> Yeah I did not wait that long... You are much more of a patient man than I ;-) 
> I do think this is a regression. Going an from an instant failure to one
> that takes over 2min is not a good thing... IMHO.
>
>> 
>> So I think the current behavior is correct.  You might be able to argue
>> that certain error codes should trigger a shorter timeout, but it would
>> need a strong argument.
> Going with the theory the window is very small, how about 
> a retry with a timeout then a failure? 

I started looking at changing the timeout and it wouldn't be too hard
(if we can agree on a suitable delay), but I feel I must ask why this is
important.
In what situation are you likely to mount with the wrong protocol, that
you aren't able to just Ctrl-C when you realized what a dumb thing you
just did?

If rpcbind isn't running, which is arguably a very similar situation
(no protocols are register) we have always had a long timeout. Why is
"just one protocol not registered" any different?


Anyway, below is the patch I was working on.  I stopped when I wasn't
sure how to handle ECONNREFUSED.

NeilBrown

diff --git a/utils/mount/stropts.c b/utils/mount/stropts.c
index d5dfb5e4a669..084776115b9f 100644
--- a/utils/mount/stropts.c
+++ b/utils/mount/stropts.c
@@ -935,24 +935,30 @@ static int nfs_try_mount(struct nfsmount_info *mi)
  * failed so far, but fail immediately if there is a local
  * error (like a bad mount option).
  *
- * ESTALE is also a temporary error because some servers
- * return ESTALE when a share is temporarily offline.
+ * If there is a remote error, like ESTALE or RPC_PROGNOTREGISTERED
+ * then it is probably permanent, but there is a small chance
+ * the it is temporary can we caught the server at an awkward
+ * time during start-up.  A shorter timeout is best for such
+ * circumstances, so return a distinct status.
  *
- * Returns 1 if we should fail immediately, or 0 if we
- * should retry.
+ * Returns PERMANENT if we should fail immediately,
+ * TEMPORARY if we should retry normally, or
+ * REMOTE if we should retry with shorter timeout.
  */
-static int nfs_is_permanent_error(int error)
+enum error_type { PERMANENT, TEMPORARY, REMOTE };
+static enum error_type nfs_error_type(int error)
 {
 	switch (error) {
 	case ESTALE:
+	case EOPNOTSUPP:	/* aka RPC_PROGNOTREGISTERED */
+		return REMOTE;
 	case ETIMEDOUT:
 	case ECONNREFUSED:
 	case EHOSTUNREACH:
-	case EOPNOTSUPP:	/* aka RPC_PROGNOTREGISTERED */
 	case EAGAIN:
-		return 0;	/* temporary */
+		return TEMPORARY;
 	default:
-		return 1;	/* permanent */
+		return PERMANENT;
 	}
 }
 
@@ -967,6 +973,7 @@ static int nfsmount_fg(struct nfsmount_info *mi)
 {
 	unsigned int secs = 1;
 	time_t timeout;
+	int last_errno = 0;
 
 	timeout = nfs_parse_retry_option(mi->options,
 					 NFS_DEF_FG_TIMEOUT_MINUTES);
@@ -987,13 +994,22 @@ static int nfsmount_fg(struct nfsmount_info *mi)
 			 */
 			return EX_SUCCESS;
 
-		if (nfs_is_permanent_error(errno))
+		switch(nfs_error_type(errno)) {
+		case PERMANENT:
+			timeout = 0;
 			break;
-
-		if (time(NULL) > timeout) {
+		case REMOTE:
+			if (errno == last_errno)
+				timeout = 0;
+			break;
+		case TEMPORARY:
 			errno = ETIMEDOUT;
 			break;
 		}
+		last_errno = errno;
+
+		if (time(NULL) > timeout)
+			break;
 
 		if (errno != ETIMEDOUT) {
 			if (sleep(secs))
@@ -1020,7 +1036,7 @@ static int nfsmount_parent(struct nfsmount_info *mi)
 	if (nfs_try_mount(mi))
 		return EX_SUCCESS;
 
-	if (nfs_is_permanent_error(errno)) {
+	if (nfs_error_type(errno) == PERMANENT) {
 		mount_error(mi->spec, mi->node, errno);
 		return EX_FAIL;
 	}
@@ -1055,8 +1071,14 @@ static int nfsmount_child(struct nfsmount_info *mi)
 		if (nfs_try_mount(mi))
 			return EX_SUCCESS;
 
-		if (nfs_is_permanent_error(errno))
+		switch (nfs_error_type(errno)) {
+		case REMOTE: /* Doesn't hurt to keep trying remote errors
+			      * when in the background
+			      */
+		case PERMANENT:
+			timeout = 0;
 			break;
+		}
 
 		if (time(NULL) > timeout)
 			break;
Attachment:
signature.asc

Description: PGP signature