Re: AutoFS+NFSv4 server down = LOOOOONG timeout.

Carlos André <candrecn@xxxxxxxxx> · Thu, 13 Aug 2009 12:18:55 -0300

Filled bug report:
https://bugzilla.redhat.com/show_bug.cgi?id=517349

Thanks!

2009/8/13 Carlos André <candrecn@xxxxxxxxx>:
> 2009/8/13 Ian Kent <ikent@xxxxxxxxxx>:
>> Carlos André wrote:
>>> Today (2009-08-12) I'm using:
>>> kernel-2.6.18-128.2.1.el5
>>> autofs-5.0.1-0.rc2.102.el5_3.1
>>
>> Thanks,
>>
>> My mistake, the wait time I was referring to is used for umounts during
>> expires and is present in rev rc2.102.
>>
>> It shouldn't be hard to add this for mount as well.
>> Would you like me to put something together?
>
> Sure! that 'll help me a lot (and for sure another ppl) :) Thanks :)
>
>>
>> Probably would be good to test something out to see if we can make a
>> difference with the killing mount after some configured timeout but, if
>> we make progress, probably the best way to deal with it is for you to
>> log a bug against rhel-5 so I can get it committed to the rhel package.
>> The possible issue is that I'm not sure if the RPC subsystem in the
>> above rhel kernel will respond well to process death with potential
>> outstanding requests. But we'll see.
>
> Ok, on my way :)
>
> Thanks a lot!
>
>>
>>>
>>>
>>> Look my last test:
>>> --------------------------------------------------------------
>>> [root@KSTATION areas]# time ls testdown
>>> ls: testdown: No such file or directory
>>>
>>> real    3m9.025s
>>> user    0m0.000s
>>> sys     0m0.002s
>>>
>>>
>>>
>>>
>>> Aug 12 12:57:07 KSTATION automount[15471]: sun_mount: parse(sun):
>>> mounting root /misc/areas, mountpoint testdown, what
>>> 1.2.3.4:/areas/testdown, fstype nfs4, options
>>> acl,sec=krb5p,proto=tcp,retry=0
>>> Aug 12 12:57:07 KSTATION automount[15471]: do_mount:
>>> 1.2.3.4:/areas/testdown /misc/areas/testdown type nfs4 options
>>> acl,sec=krb5p,proto=tcp,retry=0 using module nfs4
>>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs):
>>> root=/misc/areas name=testdown what=1.2.3.4:/areas/testdown,
>>> fstype=nfs4, options=acl,sec=krb5p,proto=tcp,retry=0
>>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs):
>>> nfs options="acl,sec=krb5p,proto=tcp,retry=0", nosymlink=0, ro=0
>>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs):
>>> calling mkdir_path /misc/areas/testdown
>>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs):
>>> calling mount -t nfs4 -s -o acl,sec=krb5p,proto=tcp,retry=0
>>> 1.2.3.4:/areas/testdown /misc/areas/testdown
>>> Aug 12 12:58:12 KSTATION automount[15471]: st_expire: state 1 path /misc
>>> Aug 12 12:58:12 KSTATION automount[15471]: expire_proc: exp_proc =
>>> 3078093712 path /misc
>>> Aug 12 12:58:13 KSTATION automount[15471]: expire_proc_indirect: 2
>>> submounts remaining in /misc
>>> Aug 12 12:58:13 KSTATION automount[15471]: expire_cleanup: got thid
>>> 3078093712 path /misc stat 3
>>> Aug 12 12:58:13 KSTATION automount[15471]: expire_cleanup: sigchld:
>>> exp 3078093712 finished, switching from 2 to 1
>>> Aug 12 12:58:13 KSTATION automount[15471]: st_ready: st_ready(): state
>>> = 2 path /misc
>>> Aug 12 12:59:28 KSTATION automount[15471]: st_expire: state 1 path /misc
>>> Aug 12 12:59:28 KSTATION automount[15471]: expire_proc: exp_proc =
>>> 3078093712 path /misc
>>> Aug 12 12:59:28 KSTATION automount[15471]: expire_proc_indirect: 2
>>> submounts remaining in /misc
>>> Aug 12 12:59:28 KSTATION automount[15471]: expire_cleanup: got thid
>>> 3078093712 path /misc stat 3
>>> Aug 12 12:59:28 KSTATION automount[15471]: expire_cleanup: sigchld:
>>> exp 3078093712 finished, switching from 2 to 1
>>> Aug 12 12:59:28 KSTATION automount[15471]: st_ready: st_ready(): state
>>> = 2 path /misc
>>> Aug 12 13:00:16 KSTATION automount[15471]: >> mount: mount to NFS
>>> server '1.2.3.4' failed: timed out (giving up).
>>> Aug 12 13:00:16 KSTATION automount[15471]: mount(nfs): nfs: mount
>>> failure 1.2.3.4:/areas/testdown on /misc/areas/testdown
>>> Aug 12 13:00:16 KSTATION automount[15471]: send_fail: token = 17
>>> Aug 12 13:00:16 KSTATION automount[15471]: failed to mount /misc/areas/testdown
>>> Aug 12 13:00:43 KSTATION automount[15471]: st_expire: state 1 path /misc
>>> --------------------------------------------------------------
>>>
>>> 2009/8/12 Ian Kent <ikent@xxxxxxxxxx>:
>>>> Carlos André wrote:
>>>>> Hi Ian,
>>>>> I'm getting crazy trying put "retry=" to work on mount... this option
>>>>> just DONT WORK if use proto=tcp and/OR kerberos (sec=krb5/krb5i/krb5p)
>>>>> like you can see on my previous emails...
>>>> Right, my mistake for not looking closely enough at post.
>>>>
>>>> Maybe this is related to the same sort of problem we had with mount in
>>>> the past, before the options parsing went into the kernel, where other
>>>> services, like portmapper (or rpcbind), were being done with different
>>>> timeout parameters before the RPC calls for mounting. That's just an
>>>> example as NFSv4 shouldn't be sensitive to portmapper anyway.
>>>>
>>>> But what version of autofs and kernel did you say you were using?
>>>>
>>>>> I appreciate any help.
>>>>>
>>>>> Carlos.
>>>>>
>>>>>
>>>>> 2009/8/12 Ian Kent <ikent@xxxxxxxxxx>:
>>>>>> Chuck Lever wrote:
>>>>>>> On Aug 11, 2009, at 8:41 AM, Carlos André wrote:
>>>>>>>> This long timeout is good if workstation need mount a critical
>>>>>>>> directory using /etc/fstab on boot (for example)..
>>>>>>>> But in my case, using this loooong timeout doesnt make any sense,
>>>>>>>> since autofs retry mount directory on-access. This in fact gives me
>>>>>>>> alot of headaches, coz user login 'll just hangs if one server goes
>>>>>>>> down for any reason, and will again hangs if user try access directory
>>>>>>>> pointing to a NFS down server...
>>>>>>> "retry=0" means the mount command will fail as soon as the first
>>>>>>> mount(2) system call fails.  When you set SYN retries to 1, this means
>>>>>>> after 9 seconds, the connect fails, and that causes the mount(2) system
>>>>>>> call to fail.
>>>>>>>
>>>>>>> Recent conversations with Ian suggested that a long timeout was desired
>>>>>>> for automounter as well as other cases.  Ian, is there something else we
>>>>>>> need to consider to determine the correct retry timeout for NFS/TCP
>>>>>>> mount points handled via automounter?  How should mount.nfs wait so we
>>>>>>> don't make other use cases worse?  (Looks like most of the history is
>>>>>>> intact below).
>>>>>> Of course we know that autofs is entirely at the mercy of mount(8) (and
>>>>>> mount.nfs in particular). This has always been a difficult situation for
>>>>>> the automounter because interactive mount invocations should wait. But I
>>>>>> believe automount mounts should always time out quickly, but that leads
>>>>>> to its own set of problems, especially when home directories are concerned.
>>>>>>
>>>>>> I think adding "retry=0" is the right thing to do myself but I'm not
>>>>>> certain that will work as we expect. I'll have to do some experimentation.
>>>>>>
>>>>>>> How long do you think is appropriate for the automounter to wait if the
>>>>>>> server is down, in your case, Carlos?
>>>>>>>
>>>>>>>> Am losing something or there have was something weirdo...!?
>>>>>>>> ------------------------------------------------
>>>>>>>> [root@KSTATION ~]# echo 5 > /proc/sys/net/ipv4/tcp_syn_retries  [DEFAULT]
>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>> proto=tcp,retry=1
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>
>>>>>>>> real    3m9.000s
>>>>>>>> user    0m0.002s
>>>>>>>> sys     0m0.001s
>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>> sec=krb5p,proto=tcp,retry=1
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>
>>>>>>>> real    3m9.000s
>>>>>>>> user    0m0.000s
>>>>>>>> sys     0m0.002s
>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>> proto=tcp,retry=0
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>
>>>>>>>> real    3m9.001s
>>>>>>>> user    0m0.000s
>>>>>>>> sys     0m0.003s
>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>> sec=krb5p,proto=tcp,retry=0
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>
>>>>>>>> real    3m9.001s
>>>>>>>> user    0m0.002s
>>>>>>>> sys     0m0.001s
>>>>>>>>
>>>>>>>> [root@KSTATION ~]# echo 1 > /proc/sys/net/ipv4/tcp_syn_retries [ 5 to 1 ]
>>>>>>>>
>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>> proto=tcp,retry=1
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 6]
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>
>>>>>>>> real    1m3.002s
>>>>>>>> user    0m0.000s
>>>>>>>> sys     0m0.002s
>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>> sec=krb5p,proto=tcp,retry=1
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 13]
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>
>>>>>>>> real    2m6.000s
>>>>>>>> user    0m0.000s
>>>>>>>> sys     0m0.002s
>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>> proto=tcp,retry=0
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>
>>>>>>>> real    0m9.003s
>>>>>>>> user    0m0.001s
>>>>>>>> sys     0m0.002s
>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>> sec=krb5p,proto=tcp,retry=0
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 13]
>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>
>>>>>>>> real    2m6.001s
>>>>>>>> user    0m0.001s
>>>>>>>> sys     0m0.002s
>>>>>>>> [root@KSTATION ~]#
>>>>>>>> ------------------------------------------------
>>>>>>>> max timeout goes to 2m6s changing tcp_syn_retries from 5 to 1... and
>>>>>>>> using retry=0 without kerberos I got only 9s...
>>>>>>>>
>>>>>>>> *sigh*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2009/8/10 Chuck Lever <chuck.lever@xxxxxxxxxx>:
>>>>>>>>> On Aug 10, 2009, at 4:05 PM, Carlos André wrote:
>>>>>>>>>> Something funny: Using default tcp_syn_retries (5) i got
>>>>>>>>>> "3,6,12,24,48,96" secs interval... but if i change tcp_syn_retries to
>>>>>>>>>> 1 i got "3,6,3,6,3,6..." secs interval...
>>>>>>>>> Right.  Normally the RPC client calls the kernel's socket connect
>>>>>>>>> function,
>>>>>>>>> which does 6 SYN retries.  That one call usually takes longer than
>>>>>>>>> the RPC
>>>>>>>>> client's connect timeout, so it only makes one connect call, and then
>>>>>>>>> fails.
>>>>>>>>>
>>>>>>>>> Reducing the number of SYN retries per connect attempt causes the RPC
>>>>>>>>> client
>>>>>>>>> to retry the connect call until its connect timeout expires.  Each
>>>>>>>>> connect
>>>>>>>>> call resets the SYN timeout to 3 seconds.
>>>>>>>>>
>>>>>>>>>> [root@KSERVER mnt]# time mount 1.2.3.4:/blabla tmp/ -t nfs4 -o
>>>>>>>>>> sec=krb5p,proto=tcp
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>
>>>>>>>>>> real    3m9.000s
>>>>>>>>>> user    0m0.000s
>>>>>>>>>> sys     0m0.002s
>>>>>>>>>>
>>>>>>>>>> [root@KSERVER /]# echo 1 > /proc/sys/net/ipv4/tcp_syn_retries
>>>>>>>>>> [root@KSERVER mnt]# time mount 1.2.3.4:/blabla tmp/ -t nfs4 -o
>>>>>>>>>> sec=krb5p,proto=tcp  ("retry=1" = no change)
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>
>>>>>>>>>> real    2m6.004s
>>>>>>>>>> user    0m0.000s
>>>>>>>>>> sys     0m0.004s
>>>>>>>>>>
>>>>>>>>>> (3,6,3,6... secs interval)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2009/8/10 Carlos André <candrecn@xxxxxxxxx>:
>>>>>>>>>>> No, i'm just using packages from CentOS repo...
>>>>>>>>>>>
>>>>>>>>>>> And u're right about expo retries... with tcpdump i've monitored
>>>>>>>>>>> traffic and i got SYN retries in 3, 6, 12, 24, 48, 96 secs on port
>>>>>>>>>>> 2049...
>>>>>>>>>>> I tried use "retry=1" option on mount without any change... I dont
>>>>>>>>>>> want change source or tcp timers... just NFSv4 client.
>>>>>>>>>>>
>>>>>>>>>>> 2009/8/10 Chuck Lever <chuck.lever@xxxxxxxxxx>:
>>>>>>>>>>>> On Aug 10, 2009, at 2:29 PM, Carlos André wrote:
>>>>>>>>>>>>> Bruce, no... you're right.  I'm describing a situation where my
>>>>>>>>>>>>> server
>>>>>>>>>>>>> died... i need mount fail faster (10 or 15 secs max) than 3 minutes
>>>>>>>>>>>>> and 9 seconds...
>>>>>>>>>>>> The 189 second timeout is likely how long it takes the kernel to
>>>>>>>>>>>> give up
>>>>>>>>>>>> trying to connect a TCP socket to the server (6 SYN attempts with
>>>>>>>>>>>> exponential retries, or something like that).  For stock CentOS
>>>>>>>>>>>> 5.3, I
>>>>>>>>>>>> think
>>>>>>>>>>>> user space does only a DNS lookup for normal NFSv4 mounts -- the
>>>>>>>>>>>> kernel
>>>>>>>>>>>> just
>>>>>>>>>>>> tries to connect a TCP socket to port 2049, with no preceding rpcbind
>>>>>>>>>>>> request.
>>>>>>>>>>>>
>>>>>>>>>>>> Carlos, let us know if you have replaced any NFS-related CentOS
>>>>>>>>>>>> components
>>>>>>>>>>>> (kernel, nfs-utils) with something you've built yourself.
>>>>>>>>>>>>
>>>>>>>>>>>>> 2009/8/7 J. Bruce Fields <bfields@xxxxxxxxxxxx>:
>>>>>>>>>>>>>> On Fri, Aug 07, 2009 at 09:42:18AM +0300, Benny Halevy wrote:
>>>>>>>>>>>>>>> On Aug. 07, 2009, 3:18 +0300, Carlos André <candrecn@xxxxxxxxx>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> Anyone ?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2009/7/29 Carlos André <candrecn@xxxxxxxxx>:
>>>>>>>>>>>>>>>>> PPL, I need put a CentOS 5.3 (updated) NFSv4 server to work with
>>>>>>>>>>>>>>>>> Kerberos
>>>>>>>>>>>>>>>>> and AutoFS, but i got a problem: If NFS server goes down i get a
>>>>>>>>>>>>>>>>> LOOOOOOONG
>>>>>>>>>>>>>>>>> mount timeout on CentOS 5.3 (updated) NFSv4 client...
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since i need mount some (3 to 6) dirs at user logon process, if
>>>>>>>>>>>>>>>>> mount
>>>>>>>>>>>>>>>>> hangs,
>>>>>>>>>>>>>>>>> user logon hangs. Then i want configure it to timeout (if server
>>>>>>>>>>>>>>>>> down)
>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>> 10-15 secs (MAX) on each mount attempt.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I already make a lab and tried a LOT of combinations, there my
>>>>>>>>>>>>>>>>> findings
>>>>>>>>>>>>>>>>> (server DOWN IP: 172.16.0.10 / client IP: 172.16.1.10) using
>>>>>>>>>>>>>>>>> basic
>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>> (time mount 172.16.0.10:/remotedir /localdir/ -t nfs4 -o
>>>>>>>>>>>>>>>>> sec=krb5,proto=<tcp/udp>) from NFS client:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> - Once i try access mount point using AutoFS (proto=tcp OR
>>>>>>>>>>>>>>>>> proto=udp)
>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>> hangs for 189 secs (3m9s: real  3m9.001s)  until show error
>>>>>>>>>>>>>>>>> (mount:
>>>>>>>>>>>>>>>>> mount to
>>>>>>>>>>>>>>>>> NFS server '172.16.0.10' failed: timed out (giving up))
>>>>>>>>>>>>>>> Sounds like you're hitting the server's grace period.
>>>>>>>>>>>>>> I thought he was describing a situation where the server the server
>>>>>>>>>>>>>> is completely gone and isn't coming back, and wondering how to make
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> mount fail faster.  But I may be misunderstanding.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --b.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>>>> linux-nfs" in
>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>> --
>>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Chuck Lever
>>>>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html