Re: AutoFS+NFSv4 server down = LOOOOONG timeout.

Carlos André <candrecn@xxxxxxxxx> · Wed, 12 Aug 2009 12:00:19 -0300

Hi Ian,
I'm getting crazy trying put "retry=" to work on mount... this option
just DONT WORK if use proto=tcp and/OR kerberos (sec=krb5/krb5i/krb5p)
like you can see on my previous emails...

I appreciate any help.

Carlos.

2009/8/12 Ian Kent <ikent@xxxxxxxxxx>:
> Chuck Lever wrote:
>> On Aug 11, 2009, at 8:41 AM, Carlos André wrote:
>>> This long timeout is good if workstation need mount a critical
>>> directory using /etc/fstab on boot (for example)..
>>> But in my case, using this loooong timeout doesnt make any sense,
>>> since autofs retry mount directory on-access. This in fact gives me
>>> alot of headaches, coz user login 'll just hangs if one server goes
>>> down for any reason, and will again hangs if user try access directory
>>> pointing to a NFS down server...
>>
>> "retry=0" means the mount command will fail as soon as the first
>> mount(2) system call fails.  When you set SYN retries to 1, this means
>> after 9 seconds, the connect fails, and that causes the mount(2) system
>> call to fail.
>>
>> Recent conversations with Ian suggested that a long timeout was desired
>> for automounter as well as other cases.  Ian, is there something else we
>> need to consider to determine the correct retry timeout for NFS/TCP
>> mount points handled via automounter?  How should mount.nfs wait so we
>> don't make other use cases worse?  (Looks like most of the history is
>> intact below).
>
> Of course we know that autofs is entirely at the mercy of mount(8) (and
> mount.nfs in particular). This has always been a difficult situation for
> the automounter because interactive mount invocations should wait. But I
> believe automount mounts should always time out quickly, but that leads
> to its own set of problems, especially when home directories are concerned.
>
> I think adding "retry=0" is the right thing to do myself but I'm not
> certain that will work as we expect. I'll have to do some experimentation.
>
>>
>> How long do you think is appropriate for the automounter to wait if the
>> server is down, in your case, Carlos?
>>
>>> Am losing something or there have was something weirdo...!?
>>> ------------------------------------------------
>>> [root@KSTATION ~]# echo 5 > /proc/sys/net/ipv4/tcp_syn_retries  [DEFAULT]
>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>> proto=tcp,retry=1
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>
>>> real    3m9.000s
>>> user    0m0.002s
>>> sys     0m0.001s
>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>> sec=krb5p,proto=tcp,retry=1
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>
>>> real    3m9.000s
>>> user    0m0.000s
>>> sys     0m0.002s
>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>> proto=tcp,retry=0
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>
>>> real    3m9.001s
>>> user    0m0.000s
>>> sys     0m0.003s
>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>> sec=krb5p,proto=tcp,retry=0
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>
>>> real    3m9.001s
>>> user    0m0.002s
>>> sys     0m0.001s
>>>
>>> [root@KSTATION ~]# echo 1 > /proc/sys/net/ipv4/tcp_syn_retries [ 5 to 1 ]
>>>
>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>> proto=tcp,retry=1
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 6]
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>
>>> real    1m3.002s
>>> user    0m0.000s
>>> sys     0m0.002s
>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>> sec=krb5p,proto=tcp,retry=1
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 13]
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>
>>> real    2m6.000s
>>> user    0m0.000s
>>> sys     0m0.002s
>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>> proto=tcp,retry=0
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>
>>> real    0m9.003s
>>> user    0m0.001s
>>> sys     0m0.002s
>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>> sec=krb5p,proto=tcp,retry=0
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 13]
>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>
>>> real    2m6.001s
>>> user    0m0.001s
>>> sys     0m0.002s
>>> [root@KSTATION ~]#
>>> ------------------------------------------------
>>> max timeout goes to 2m6s changing tcp_syn_retries from 5 to 1... and
>>> using retry=0 without kerberos I got only 9s...
>>>
>>> *sigh*
>>>
>>>
>>>
>>> 2009/8/10 Chuck Lever <chuck.lever@xxxxxxxxxx>:
>>>> On Aug 10, 2009, at 4:05 PM, Carlos André wrote:
>>>>>
>>>>> Something funny: Using default tcp_syn_retries (5) i got
>>>>> "3,6,12,24,48,96" secs interval... but if i change tcp_syn_retries to
>>>>> 1 i got "3,6,3,6,3,6..." secs interval...
>>>>
>>>> Right.  Normally the RPC client calls the kernel's socket connect
>>>> function,
>>>> which does 6 SYN retries.  That one call usually takes longer than
>>>> the RPC
>>>> client's connect timeout, so it only makes one connect call, and then
>>>> fails.
>>>>
>>>> Reducing the number of SYN retries per connect attempt causes the RPC
>>>> client
>>>> to retry the connect call until its connect timeout expires.  Each
>>>> connect
>>>> call resets the SYN timeout to 3 seconds.
>>>>
>>>>> [root@KSERVER mnt]# time mount 1.2.3.4:/blabla tmp/ -t nfs4 -o
>>>>> sec=krb5p,proto=tcp
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>
>>>>> real    3m9.000s
>>>>> user    0m0.000s
>>>>> sys     0m0.002s
>>>>>
>>>>> [root@KSERVER /]# echo 1 > /proc/sys/net/ipv4/tcp_syn_retries
>>>>> [root@KSERVER mnt]# time mount 1.2.3.4:/blabla tmp/ -t nfs4 -o
>>>>> sec=krb5p,proto=tcp  ("retry=1" = no change)
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>
>>>>> real    2m6.004s
>>>>> user    0m0.000s
>>>>> sys     0m0.004s
>>>>>
>>>>> (3,6,3,6... secs interval)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2009/8/10 Carlos André <candrecn@xxxxxxxxx>:
>>>>>>
>>>>>> No, i'm just using packages from CentOS repo...
>>>>>>
>>>>>> And u're right about expo retries... with tcpdump i've monitored
>>>>>> traffic and i got SYN retries in 3, 6, 12, 24, 48, 96 secs on port
>>>>>> 2049...
>>>>>> I tried use "retry=1" option on mount without any change... I dont
>>>>>> want change source or tcp timers... just NFSv4 client.
>>>>>>
>>>>>> 2009/8/10 Chuck Lever <chuck.lever@xxxxxxxxxx>:
>>>>>>>
>>>>>>> On Aug 10, 2009, at 2:29 PM, Carlos André wrote:
>>>>>>>>
>>>>>>>> Bruce, no... you're right.  I'm describing a situation where my
>>>>>>>> server
>>>>>>>> died... i need mount fail faster (10 or 15 secs max) than 3 minutes
>>>>>>>> and 9 seconds...
>>>>>>>
>>>>>>> The 189 second timeout is likely how long it takes the kernel to
>>>>>>> give up
>>>>>>> trying to connect a TCP socket to the server (6 SYN attempts with
>>>>>>> exponential retries, or something like that).  For stock CentOS
>>>>>>> 5.3, I
>>>>>>> think
>>>>>>> user space does only a DNS lookup for normal NFSv4 mounts -- the
>>>>>>> kernel
>>>>>>> just
>>>>>>> tries to connect a TCP socket to port 2049, with no preceding rpcbind
>>>>>>> request.
>>>>>>>
>>>>>>> Carlos, let us know if you have replaced any NFS-related CentOS
>>>>>>> components
>>>>>>> (kernel, nfs-utils) with something you've built yourself.
>>>>>>>
>>>>>>>> 2009/8/7 J. Bruce Fields <bfields@xxxxxxxxxxxx>:
>>>>>>>>>
>>>>>>>>> On Fri, Aug 07, 2009 at 09:42:18AM +0300, Benny Halevy wrote:
>>>>>>>>>>
>>>>>>>>>> On Aug. 07, 2009, 3:18 +0300, Carlos André <candrecn@xxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Anyone ?
>>>>>>>>>>>
>>>>>>>>>>> 2009/7/29 Carlos André <candrecn@xxxxxxxxx>:
>>>>>>>>>>>>
>>>>>>>>>>>> PPL, I need put a CentOS 5.3 (updated) NFSv4 server to work with
>>>>>>>>>>>> Kerberos
>>>>>>>>>>>> and AutoFS, but i got a problem: If NFS server goes down i get a
>>>>>>>>>>>> LOOOOOOONG
>>>>>>>>>>>> mount timeout on CentOS 5.3 (updated) NFSv4 client...
>>>>>>>>>>>>
>>>>>>>>>>>> Since i need mount some (3 to 6) dirs at user logon process, if
>>>>>>>>>>>> mount
>>>>>>>>>>>> hangs,
>>>>>>>>>>>> user logon hangs. Then i want configure it to timeout (if server
>>>>>>>>>>>> down)
>>>>>>>>>>>> after
>>>>>>>>>>>> 10-15 secs (MAX) on each mount attempt.
>>>>>>>>>>>>
>>>>>>>>>>>> I already make a lab and tried a LOT of combinations, there my
>>>>>>>>>>>> findings
>>>>>>>>>>>> (server DOWN IP: 172.16.0.10 / client IP: 172.16.1.10) using
>>>>>>>>>>>> basic
>>>>>>>>>>>> command
>>>>>>>>>>>> (time mount 172.16.0.10:/remotedir /localdir/ -t nfs4 -o
>>>>>>>>>>>> sec=krb5,proto=<tcp/udp>) from NFS client:
>>>>>>>>>>>>
>>>>>>>>>>>> - Once i try access mount point using AutoFS (proto=tcp OR
>>>>>>>>>>>> proto=udp)
>>>>>>>>>>>> it
>>>>>>>>>>>> hangs for 189 secs (3m9s: real  3m9.001s)  until show error
>>>>>>>>>>>> (mount:
>>>>>>>>>>>> mount to
>>>>>>>>>>>> NFS server '172.16.0.10' failed: timed out (giving up))
>>>>>>>>>>
>>>>>>>>>> Sounds like you're hitting the server's grace period.
>>>>>>>>>
>>>>>>>>> I thought he was describing a situation where the server the server
>>>>>>>>> is completely gone and isn't coming back, and wondering how to make
>>>>>>>>> the
>>>>>>>>> mount fail faster.  But I may be misunderstanding.
>>>>>>>>>
>>>>>>>>> --b.
>>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> linux-nfs" in
>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>>> chuck[dot]lever[at]oracle[dot]com
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>>
>>
>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html