Re: AutoFS+NFSv4 server down = LOOOOONG timeout.

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 27 Aug 2009 10:54:27 -0400

On Aug 27, 2009, at 10:52 AM, Trond Myklebust wrote:
On Thu, 2009-08-27 at 10:38 -0400, Chuck Lever wrote:
On Aug 27, 2009, at 4:54 AM, Ian Kent wrote:
Ian Kent wrote:
Carlos André wrote:
Hi Ian,

Thanks for patch and sorry for delay (i'm expecting receive u
reply on
bug track, not here) :)

But, this patch doesnt worked to me like expected...  :(

Firstly I've changed "#MOUNT_WAIT=-1" to "MOUNT_WAIT=10"
and later changed "10" to "2" with same results...
(always restarting service, of course :)

Then, tried remove "sec=krb5p", and later removed "nfs4" but i got
same results again.

Or i'm doing something wrong?

[root@KSTATION areas]# automount -V

Linux automount version 5.0.1-0.rc2.131.bz517349.1
[...]

[root@KSTATION areas]# time ls -la testdown
ls: testedown: No such file or directory

real    3m9.006s
user    0m0.002s
sys     0m0.000s

OK, that isn't behaving the way I expect, I'll have a look.

LOGGING:
-----------------------------------------
Aug 24 09:23:51 KSTATION automount[20803]: mount_mount:  
mount(nfs):
calling mount -t nfs4 -s -o rw,acl,sec=krb5p 1.2.3.4:/areas/ 
testdown
/misc/areas/testdown
Aug 24 09:27:00 KSTATION automount[20803]: mount(nfs): nfs: mount
failure 1.2.3.4:/areas/testdown on /misc/areas/testdown
Aug 24 09:27:00 KSTATION automount[20803]: ioctl_send_fail: token
= 91
Aug 24 09:27:00 KSTATION automount[20803]: failed to mount /misc/
areas/testdown
-----------------------------------------

Having a look at this I suspect the reason it doesn't work as  
expected
is the waitpid(2) we do after sending the TERM signal to the mount
process (which we have to do) is not returning. This is likely  
because
the mount process isn't giving up in a shorter time as it used to.

You're thinking maybe mount(2) should be as interruptible as the
socket calls that the mount command used to do?  That might be
reasonable, and I can take a look at that.

In recent kernels, all those RPC calls should be using TASK_KILLABLE
sleep states. SIGTERM should cause them to abort, provided that some
process isn't blocking it.

Perhaps TASK_KILLABLE could be backported to RHEL-5?

That's pretty extensive, with hooks in the page cache.  I doubt RH  
would go for that.

In the kernel, if the rpcbind for the MNT request is async, that  
would
be done by rpciod.  That's a different process, so the signal  
wouldn't
have any effect on the mount.  I have a patch that converts the MNT
client to use rpcb_getport_sync() which might help in this case.

The client shouldn't be using rpcbind at all when doing a NFSv4 mount.

Yep, forgot this was NFSv4.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html