Re: AutoFS+NFSv4 server down = LOOOOONG timeout.

Carlos André <candrecn@xxxxxxxxx> · Mon, 24 Aug 2009 15:07:10 -0300



Ian,
Thanks for Support/Help :)


2009/8/24 Ian Kent <ikent@xxxxxxxxxx>:
> Carlos André wrote:
>> Hi Ian,
>>
>> Thanks for patch and sorry for delay (i'm expecting receive u reply on
>> bug track, not here) :)
>>
>> But, this patch doesnt worked to me like expected...  :(
>>
>>
>> Firstly I've changed "#MOUNT_WAIT=-1" to "MOUNT_WAIT=10"
>> and later changed "10" to "2" with same results...
>> (always restarting service, of course :)
>>
>> Then, tried remove "sec=krb5p", and later removed "nfs4" but i got
>> same results again.
>>
>> Or i'm doing something wrong?
>>
>>
>> [root@KSTATION areas]# automount -V
>>
>> Linux automount version 5.0.1-0.rc2.131.bz517349.1
>> [...]
>>
>> [root@KSTATION areas]# time ls -la testdown
>> ls: testedown: No such file or directory
>>
>> real    3m9.006s
>> user    0m0.002s
>> sys     0m0.000s
>
> OK, that isn't behaving the way I expect, I'll have a look.
>
>>
>>
>> LOGGING:
>> -----------------------------------------
>> Aug 24 09:23:51 KSTATION automount[20803]: mount_mount: mount(nfs):
>> calling mount -t nfs4 -s -o rw,acl,sec=krb5p 1.2.3.4:/areas/testdown
>> /misc/areas/testdown
>> Aug 24 09:27:00 KSTATION automount[20803]: mount(nfs): nfs: mount
>> failure 1.2.3.4:/areas/testdown on /misc/areas/testdown
>> Aug 24 09:27:00 KSTATION automount[20803]: ioctl_send_fail: token = 91
>> Aug 24 09:27:00 KSTATION automount[20803]: failed to mount /misc/areas/testdown
>> -----------------------------------------
>>
>>
>>
>>
>>
>> 2009/8/17 Ian Kent <ikent@xxxxxxxxxx>:
>>> On Thu, 2009-08-13 at 12:18 -0300, Carlos André wrote:
>>>> Filled bug report:
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=517349
>>> Hi Carlos,
>>>
>>> I have a patched source rpm to add a mount wait parameter to autofs
>>> located at:
>>> http://people.redhat.com/~ikent/autofs-5.0.1-0.rc2.131.bz517349.1
>>>
>>> Could you build it and see if it works.
>>> I haven't tested it at all but it is fairly straight forward.
>>> It is still unclear if this is the right way to do this and what the
>>> consequences are in sending a term signal to mount. This mount request
>>> will likely be followed by other requests for the same mount causing an
>>> accumulation of mount(8) processes waiting for RPC timeouts before they
>>> can answer the TERM signal.
>>>
>>> Anyway, for information the patch included in the source rpm above is:
>>>
>>> autofs-5.0.4 - add mount wait parameter
>>>
>>> From: Ian Kent <raven@xxxxxxxxxx>
>>>
>>> Often delays when trying to mount from a server that is not reponding
>>> for some reason are undesirable. To try and prevent these delays we
>>> provide a configuration setting to limit the time that we wait for
>>> our spawned mount(8) process to complete before sending it a SIGTERM
>>> signal. This patch adds a configuration parameter to allow us to
>>> request we limit the time we wait for mount(8) to complete before
>>> send it a TERM signal.
>>> ---
>>>
>>>  daemon/spawn.c                 |    3 ++-
>>>  include/defaults.h             |    2 ++
>>>  lib/defaults.c                 |   13 +++++++++++++
>>>  man/auto.master.5.in           |    7 +++++++
>>>  redhat/autofs.sysconfig.in     |    9 +++++++++
>>>  samples/autofs.conf.default.in |    9 +++++++++
>>>  6 files changed, 42 insertions(+), 1 deletion(-)
>>>
>>>
>>> --- autofs-5.0.1.orig/daemon/spawn.c
>>> +++ autofs-5.0.1/daemon/spawn.c
>>> @@ -312,6 +312,7 @@ int spawn_mount(unsigned logopt, ...)
>>>        unsigned int options;
>>>        unsigned int retries = MTAB_LOCK_RETRIES;
>>>        int update_mtab = 1, ret, printed = 0;
>>> +       unsigned int wait = defaults_get_mount_wait();
>>>        char buf[PATH_MAX];
>>>
>>>        /* If we use mount locking we can't validate the location */
>>> @@ -353,7 +354,7 @@ int spawn_mount(unsigned logopt, ...)
>>>        va_end(arg);
>>>
>>>        while (retries--) {
>>> -               ret = do_spawn(logopt, -1, options, prog, (const char **) argv);
>>> +               ret = do_spawn(logopt, wait, options, prog, (const char **) argv);
>>>                if (ret & MTAB_NOTUPDATED) {
>>>                        struct timespec tm = {3, 0};
>>>
>>> --- autofs-5.0.1.orig/include/defaults.h
>>> +++ autofs-5.0.1/include/defaults.h
>>> @@ -24,6 +24,7 @@
>>>
>>>  #define DEFAULT_TIMEOUT                        600
>>>  #define DEFAULT_NEGATIVE_TIMEOUT       60
>>> +#define DEFAULT_MOUNT_WAIT             -1
>>>  #define DEFAULT_UMOUNT_WAIT            12
>>>  #define DEFAULT_BROWSE_MODE            1
>>>  #define DEFAULT_LOGGING                        0
>>> @@ -62,6 +63,7 @@ struct ldap_schema *defaults_get_schema(
>>>  struct ldap_searchdn *defaults_get_searchdns(void);
>>>  void defaults_free_searchdns(struct ldap_searchdn *);
>>>  unsigned int defaults_get_append_options(void);
>>> +unsigned int defaults_get_mount_wait(void);
>>>  unsigned int defaults_get_umount_wait(void);
>>>  const char *defaults_get_auth_conf_file(void);
>>>  unsigned int defaults_get_map_hash_table_size(void);
>>> --- autofs-5.0.1.orig/lib/defaults.c
>>> +++ autofs-5.0.1/lib/defaults.c
>>> @@ -45,6 +45,7 @@
>>>  #define ENV_NAME_VALUE_ATTR            "VALUE_ATTRIBUTE"
>>>
>>>  #define ENV_APPEND_OPTIONS             "APPEND_OPTIONS"
>>> +#define ENV_MOUNT_WAIT                 "MOUNT_WAIT"
>>>  #define ENV_UMOUNT_WAIT                        "UMOUNT_WAIT"
>>>  #define ENV_AUTH_CONF_FILE             "AUTH_CONF_FILE"
>>>
>>> @@ -323,6 +324,7 @@ unsigned int defaults_read_config(unsign
>>>                    check_set_config_value(key, ENV_NAME_ENTRY_ATTR, value, to_syslog) ||
>>>                    check_set_config_value(key, ENV_NAME_VALUE_ATTR, value, to_syslog) ||
>>>                    check_set_config_value(key, ENV_APPEND_OPTIONS, value, to_syslog) ||
>>> +                   check_set_config_value(key, ENV_MOUNT_WAIT, value, to_syslog) ||
>>>                    check_set_config_value(key, ENV_UMOUNT_WAIT, value, to_syslog) ||
>>>                    check_set_config_value(key, ENV_AUTH_CONF_FILE, value, to_syslog) ||
>>>                    check_set_config_value(key, ENV_MAP_HASH_TABLE_SIZE, value, to_syslog))
>>> @@ -652,6 +654,17 @@ unsigned int defaults_get_append_options
>>>        return res;
>>>  }
>>>
>>> +unsigned int defaults_get_mount_wait(void)
>>> +{
>>> +       long wait;
>>> +
>>> +       wait = get_env_number(ENV_MOUNT_WAIT);
>>> +       if (wait < 0)
>>> +               wait = DEFAULT_MOUNT_WAIT;
>>> +
>>> +       return (unsigned int) wait;
>>> +}
>>> +
>>>  unsigned int defaults_get_umount_wait(void)
>>>  {
>>>        long wait;
>>> --- autofs-5.0.1.orig/man/auto.master.5.in
>>> +++ autofs-5.0.1/man/auto.master.5.in
>>> @@ -175,6 +175,13 @@ Set the default timeout for caching fail
>>>  60). If the equivalent command line option is given it will override this
>>>  setting.
>>>  .TP
>>> +.B MOUNT_WAIT
>>> +Set the default time to wait for a response from a spawned mount(8)
>>> +before sending it a SIGTERM. Note that we still need to wait for the
>>> +RPC layer to timeout before the sub-process exits so this isn't ideal
>>> +but it is the best we can do. The default is to wait until mount(8)
>>> +returns without intervention.
>>> +.TP
>>>  .B UMOUNT_WAIT
>>>  Set the default time to wait for a response from a spawned umount(8)
>>>  before sending it a SIGTERM. Note that we still need to wait for the
>>> --- autofs-5.0.1.orig/redhat/autofs.sysconfig.in
>>> +++ autofs-5.0.1/redhat/autofs.sysconfig.in
>>> @@ -14,6 +14,15 @@ TIMEOUT=300
>>>  #
>>>  #NEGATIVE_TIMEOUT=60
>>>  #
>>> +# MOUNT_WAIT - time to wait for a response from umount(8).
>>> +#             Setting this timeout can cause problems when
>>> +#             mount would otherwise wait for a server that
>>> +#             is temporarily unavailable, such as when it's
>>> +#             restarting. The defailt of waiting for mount(8)
>>> +#             usually results in a wait of around 3 minutes.
>>> +#
>>> +#MOUNT_WAIT=-1
>>> +#
>>>  # UMOUNT_WAIT - time to wait for a response from umount(8).
>>>  #
>>>  #UMOUNT_WAIT=12
>>> --- autofs-5.0.1.orig/samples/autofs.conf.default.in
>>> +++ autofs-5.0.1/samples/autofs.conf.default.in
>>> @@ -14,6 +14,15 @@ TIMEOUT=300
>>>  #
>>>  #NEGATIVE_TIMEOUT=60
>>>  #
>>> +# MOUNT_WAIT - time to wait for a response from umount(8).
>>> +#             Setting this timeout can cause problems when
>>> +#             mount would otherwise wait for a server that
>>> +#             is temporarily unavailable, such as when it's
>>> +#             restarting. The defailt of waiting for mount(8)
>>> +#             usually results in a wait of around 3 minutes.
>>> +#
>>> +#MOUNT_WAIT=-1
>>> +#
>>>  # UMOUNT_WAIT - time to wait for a response from umount(8).
>>>  #
>>>  #UMOUNT_WAIT=12
>>>
>>>
>>>> Thanks!
>>>>
>>>> 2009/8/13 Carlos André <candrecn@xxxxxxxxx>:
>>>>> 2009/8/13 Ian Kent <ikent@xxxxxxxxxx>:
>>>>>> Carlos André wrote:
>>>>>>> Today (2009-08-12) I'm using:
>>>>>>> kernel-2.6.18-128.2.1.el5
>>>>>>> autofs-5.0.1-0.rc2.102.el5_3.1
>>>>>> Thanks,
>>>>>>
>>>>>> My mistake, the wait time I was referring to is used for umounts during
>>>>>> expires and is present in rev rc2.102.
>>>>>>
>>>>>> It shouldn't be hard to add this for mount as well.
>>>>>> Would you like me to put something together?
>>>>> Sure! that 'll help me a lot (and for sure another ppl) :) Thanks :)
>>>>>
>>>>>> Probably would be good to test something out to see if we can make a
>>>>>> difference with the killing mount after some configured timeout but, if
>>>>>> we make progress, probably the best way to deal with it is for you to
>>>>>> log a bug against rhel-5 so I can get it committed to the rhel package.
>>>>>> The possible issue is that I'm not sure if the RPC subsystem in the
>>>>>> above rhel kernel will respond well to process death with potential
>>>>>> outstanding requests. But we'll see.
>>>>> Ok, on my way :)
>>>>>
>>>>> Thanks a lot!
>>>>>
>>>>>>>
>>>>>>> Look my last test:
>>>>>>> --------------------------------------------------------------
>>>>>>> [root@KSTATION areas]# time ls testdown
>>>>>>> ls: testdown: No such file or directory
>>>>>>>
>>>>>>> real    3m9.025s
>>>>>>> user    0m0.000s
>>>>>>> sys     0m0.002s
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Aug 12 12:57:07 KSTATION automount[15471]: sun_mount: parse(sun):
>>>>>>> mounting root /misc/areas, mountpoint testdown, what
>>>>>>> 1.2.3.4:/areas/testdown, fstype nfs4, options
>>>>>>> acl,sec=krb5p,proto=tcp,retry=0
>>>>>>> Aug 12 12:57:07 KSTATION automount[15471]: do_mount:
>>>>>>> 1.2.3.4:/areas/testdown /misc/areas/testdown type nfs4 options
>>>>>>> acl,sec=krb5p,proto=tcp,retry=0 using module nfs4
>>>>>>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs):
>>>>>>> root=/misc/areas name=testdown what=1.2.3.4:/areas/testdown,
>>>>>>> fstype=nfs4, options=acl,sec=krb5p,proto=tcp,retry=0
>>>>>>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs):
>>>>>>> nfs options="acl,sec=krb5p,proto=tcp,retry=0", nosymlink=0, ro=0
>>>>>>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs):
>>>>>>> calling mkdir_path /misc/areas/testdown
>>>>>>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs):
>>>>>>> calling mount -t nfs4 -s -o acl,sec=krb5p,proto=tcp,retry=0
>>>>>>> 1.2.3.4:/areas/testdown /misc/areas/testdown
>>>>>>> Aug 12 12:58:12 KSTATION automount[15471]: st_expire: state 1 path /misc
>>>>>>> Aug 12 12:58:12 KSTATION automount[15471]: expire_proc: exp_proc =
>>>>>>> 3078093712 path /misc
>>>>>>> Aug 12 12:58:13 KSTATION automount[15471]: expire_proc_indirect: 2
>>>>>>> submounts remaining in /misc
>>>>>>> Aug 12 12:58:13 KSTATION automount[15471]: expire_cleanup: got thid
>>>>>>> 3078093712 path /misc stat 3
>>>>>>> Aug 12 12:58:13 KSTATION automount[15471]: expire_cleanup: sigchld:
>>>>>>> exp 3078093712 finished, switching from 2 to 1
>>>>>>> Aug 12 12:58:13 KSTATION automount[15471]: st_ready: st_ready(): state
>>>>>>> = 2 path /misc
>>>>>>> Aug 12 12:59:28 KSTATION automount[15471]: st_expire: state 1 path /misc
>>>>>>> Aug 12 12:59:28 KSTATION automount[15471]: expire_proc: exp_proc =
>>>>>>> 3078093712 path /misc
>>>>>>> Aug 12 12:59:28 KSTATION automount[15471]: expire_proc_indirect: 2
>>>>>>> submounts remaining in /misc
>>>>>>> Aug 12 12:59:28 KSTATION automount[15471]: expire_cleanup: got thid
>>>>>>> 3078093712 path /misc stat 3
>>>>>>> Aug 12 12:59:28 KSTATION automount[15471]: expire_cleanup: sigchld:
>>>>>>> exp 3078093712 finished, switching from 2 to 1
>>>>>>> Aug 12 12:59:28 KSTATION automount[15471]: st_ready: st_ready(): state
>>>>>>> = 2 path /misc
>>>>>>> Aug 12 13:00:16 KSTATION automount[15471]: >> mount: mount to NFS
>>>>>>> server '1.2.3.4' failed: timed out (giving up).
>>>>>>> Aug 12 13:00:16 KSTATION automount[15471]: mount(nfs): nfs: mount
>>>>>>> failure 1.2.3.4:/areas/testdown on /misc/areas/testdown
>>>>>>> Aug 12 13:00:16 KSTATION automount[15471]: send_fail: token = 17
>>>>>>> Aug 12 13:00:16 KSTATION automount[15471]: failed to mount /misc/areas/testdown
>>>>>>> Aug 12 13:00:43 KSTATION automount[15471]: st_expire: state 1 path /misc
>>>>>>> --------------------------------------------------------------
>>>>>>>
>>>>>>> 2009/8/12 Ian Kent <ikent@xxxxxxxxxx>:
>>>>>>>> Carlos André wrote:
>>>>>>>>> Hi Ian,
>>>>>>>>> I'm getting crazy trying put "retry=" to work on mount... this option
>>>>>>>>> just DONT WORK if use proto=tcp and/OR kerberos (sec=krb5/krb5i/krb5p)
>>>>>>>>> like you can see on my previous emails...
>>>>>>>> Right, my mistake for not looking closely enough at post.
>>>>>>>>
>>>>>>>> Maybe this is related to the same sort of problem we had with mount in
>>>>>>>> the past, before the options parsing went into the kernel, where other
>>>>>>>> services, like portmapper (or rpcbind), were being done with different
>>>>>>>> timeout parameters before the RPC calls for mounting. That's just an
>>>>>>>> example as NFSv4 shouldn't be sensitive to portmapper anyway.
>>>>>>>>
>>>>>>>> But what version of autofs and kernel did you say you were using?
>>>>>>>>
>>>>>>>>> I appreciate any help.
>>>>>>>>>
>>>>>>>>> Carlos.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2009/8/12 Ian Kent <ikent@xxxxxxxxxx>:
>>>>>>>>>> Chuck Lever wrote:
>>>>>>>>>>> On Aug 11, 2009, at 8:41 AM, Carlos André wrote:
>>>>>>>>>>>> This long timeout is good if workstation need mount a critical
>>>>>>>>>>>> directory using /etc/fstab on boot (for example)..
>>>>>>>>>>>> But in my case, using this loooong timeout doesnt make any sense,
>>>>>>>>>>>> since autofs retry mount directory on-access. This in fact gives me
>>>>>>>>>>>> alot of headaches, coz user login 'll just hangs if one server goes
>>>>>>>>>>>> down for any reason, and will again hangs if user try access directory
>>>>>>>>>>>> pointing to a NFS down server...
>>>>>>>>>>> "retry=0" means the mount command will fail as soon as the first
>>>>>>>>>>> mount(2) system call fails.  When you set SYN retries to 1, this means
>>>>>>>>>>> after 9 seconds, the connect fails, and that causes the mount(2) system
>>>>>>>>>>> call to fail.
>>>>>>>>>>>
>>>>>>>>>>> Recent conversations with Ian suggested that a long timeout was desired
>>>>>>>>>>> for automounter as well as other cases.  Ian, is there something else we
>>>>>>>>>>> need to consider to determine the correct retry timeout for NFS/TCP
>>>>>>>>>>> mount points handled via automounter?  How should mount.nfs wait so we
>>>>>>>>>>> don't make other use cases worse?  (Looks like most of the history is
>>>>>>>>>>> intact below).
>>>>>>>>>> Of course we know that autofs is entirely at the mercy of mount(8) (and
>>>>>>>>>> mount.nfs in particular). This has always been a difficult situation for
>>>>>>>>>> the automounter because interactive mount invocations should wait. But I
>>>>>>>>>> believe automount mounts should always time out quickly, but that leads
>>>>>>>>>> to its own set of problems, especially when home directories are concerned.
>>>>>>>>>>
>>>>>>>>>> I think adding "retry=0" is the right thing to do myself but I'm not
>>>>>>>>>> certain that will work as we expect. I'll have to do some experimentation.
>>>>>>>>>>
>>>>>>>>>>> How long do you think is appropriate for the automounter to wait if the
>>>>>>>>>>> server is down, in your case, Carlos?
>>>>>>>>>>>
>>>>>>>>>>>> Am losing something or there have was something weirdo...!?
>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>> [root@KSTATION ~]# echo 5 > /proc/sys/net/ipv4/tcp_syn_retries  [DEFAULT]
>>>>>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>>>>>> proto=tcp,retry=1
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>
>>>>>>>>>>>> real    3m9.000s
>>>>>>>>>>>> user    0m0.002s
>>>>>>>>>>>> sys     0m0.001s
>>>>>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>>>>>> sec=krb5p,proto=tcp,retry=1
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>
>>>>>>>>>>>> real    3m9.000s
>>>>>>>>>>>> user    0m0.000s
>>>>>>>>>>>> sys     0m0.002s
>>>>>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>>>>>> proto=tcp,retry=0
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>
>>>>>>>>>>>> real    3m9.001s
>>>>>>>>>>>> user    0m0.000s
>>>>>>>>>>>> sys     0m0.003s
>>>>>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>>>>>> sec=krb5p,proto=tcp,retry=0
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>
>>>>>>>>>>>> real    3m9.001s
>>>>>>>>>>>> user    0m0.002s
>>>>>>>>>>>> sys     0m0.001s
>>>>>>>>>>>>
>>>>>>>>>>>> [root@KSTATION ~]# echo 1 > /proc/sys/net/ipv4/tcp_syn_retries [ 5 to 1 ]
>>>>>>>>>>>>
>>>>>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>>>>>> proto=tcp,retry=1
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 6]
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>
>>>>>>>>>>>> real    1m3.002s
>>>>>>>>>>>> user    0m0.000s
>>>>>>>>>>>> sys     0m0.002s
>>>>>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>>>>>> sec=krb5p,proto=tcp,retry=1
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 13]
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>
>>>>>>>>>>>> real    2m6.000s
>>>>>>>>>>>> user    0m0.000s
>>>>>>>>>>>> sys     0m0.002s
>>>>>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>>>>>> proto=tcp,retry=0
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>
>>>>>>>>>>>> real    0m9.003s
>>>>>>>>>>>> user    0m0.001s
>>>>>>>>>>>> sys     0m0.002s
>>>>>>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o
>>>>>>>>>>>> sec=krb5p,proto=tcp,retry=0
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 13]
>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>
>>>>>>>>>>>> real    2m6.001s
>>>>>>>>>>>> user    0m0.001s
>>>>>>>>>>>> sys     0m0.002s
>>>>>>>>>>>> [root@KSTATION ~]#
>>>>>>>>>>>> ------------------------------------------------
>>>>>>>>>>>> max timeout goes to 2m6s changing tcp_syn_retries from 5 to 1... and
>>>>>>>>>>>> using retry=0 without kerberos I got only 9s...
>>>>>>>>>>>>
>>>>>>>>>>>> *sigh*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2009/8/10 Chuck Lever <chuck.lever@xxxxxxxxxx>:
>>>>>>>>>>>>> On Aug 10, 2009, at 4:05 PM, Carlos André wrote:
>>>>>>>>>>>>>> Something funny: Using default tcp_syn_retries (5) i got
>>>>>>>>>>>>>> "3,6,12,24,48,96" secs interval... but if i change tcp_syn_retries to
>>>>>>>>>>>>>> 1 i got "3,6,3,6,3,6..." secs interval...
>>>>>>>>>>>>> Right.  Normally the RPC client calls the kernel's socket connect
>>>>>>>>>>>>> function,
>>>>>>>>>>>>> which does 6 SYN retries.  That one call usually takes longer than
>>>>>>>>>>>>> the RPC
>>>>>>>>>>>>> client's connect timeout, so it only makes one connect call, and then
>>>>>>>>>>>>> fails.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Reducing the number of SYN retries per connect attempt causes the RPC
>>>>>>>>>>>>> client
>>>>>>>>>>>>> to retry the connect call until its connect timeout expires.  Each
>>>>>>>>>>>>> connect
>>>>>>>>>>>>> call resets the SYN timeout to 3 seconds.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root@KSERVER mnt]# time mount 1.2.3.4:/blabla tmp/ -t nfs4 -o
>>>>>>>>>>>>>> sec=krb5p,proto=tcp
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> real    3m9.000s
>>>>>>>>>>>>>> user    0m0.000s
>>>>>>>>>>>>>> sys     0m0.002s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [root@KSERVER /]# echo 1 > /proc/sys/net/ipv4/tcp_syn_retries
>>>>>>>>>>>>>> [root@KSERVER mnt]# time mount 1.2.3.4:/blabla tmp/ -t nfs4 -o
>>>>>>>>>>>>>> sec=krb5p,proto=tcp  ("retry=1" = no change)
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying).
>>>>>>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> real    2m6.004s
>>>>>>>>>>>>>> user    0m0.000s
>>>>>>>>>>>>>> sys     0m0.004s
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> (3,6,3,6... secs interval)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2009/8/10 Carlos André <candrecn@xxxxxxxxx>:
>>>>>>>>>>>>>>> No, i'm just using packages from CentOS repo...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And u're right about expo retries... with tcpdump i've monitored
>>>>>>>>>>>>>>> traffic and i got SYN retries in 3, 6, 12, 24, 48, 96 secs on port
>>>>>>>>>>>>>>> 2049...
>>>>>>>>>>>>>>> I tried use "retry=1" option on mount without any change... I dont
>>>>>>>>>>>>>>> want change source or tcp timers... just NFSv4 client.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2009/8/10 Chuck Lever <chuck.lever@xxxxxxxxxx>:
>>>>>>>>>>>>>>>> On Aug 10, 2009, at 2:29 PM, Carlos André wrote:
>>>>>>>>>>>>>>>>> Bruce, no... you're right.  I'm describing a situation where my
>>>>>>>>>>>>>>>>> server
>>>>>>>>>>>>>>>>> died... i need mount fail faster (10 or 15 secs max) than 3 minutes
>>>>>>>>>>>>>>>>> and 9 seconds...
>>>>>>>>>>>>>>>> The 189 second timeout is likely how long it takes the kernel to
>>>>>>>>>>>>>>>> give up
>>>>>>>>>>>>>>>> trying to connect a TCP socket to the server (6 SYN attempts with
>>>>>>>>>>>>>>>> exponential retries, or something like that).  For stock CentOS
>>>>>>>>>>>>>>>> 5.3, I
>>>>>>>>>>>>>>>> think
>>>>>>>>>>>>>>>> user space does only a DNS lookup for normal NFSv4 mounts -- the
>>>>>>>>>>>>>>>> kernel
>>>>>>>>>>>>>>>> just
>>>>>>>>>>>>>>>> tries to connect a TCP socket to port 2049, with no preceding rpcbind
>>>>>>>>>>>>>>>> request.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Carlos, let us know if you have replaced any NFS-related CentOS
>>>>>>>>>>>>>>>> components
>>>>>>>>>>>>>>>> (kernel, nfs-utils) with something you've built yourself.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2009/8/7 J. Bruce Fields <bfields@xxxxxxxxxxxx>:
>>>>>>>>>>>>>>>>>> On Fri, Aug 07, 2009 at 09:42:18AM +0300, Benny Halevy wrote:
>>>>>>>>>>>>>>>>>>> On Aug. 07, 2009, 3:18 +0300, Carlos André <candrecn@xxxxxxxxx>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>> Anyone ?
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> 2009/7/29 Carlos André <candrecn@xxxxxxxxx>:
>>>>>>>>>>>>>>>>>>>>> PPL, I need put a CentOS 5.3 (updated) NFSv4 server to work with
>>>>>>>>>>>>>>>>>>>>> Kerberos
>>>>>>>>>>>>>>>>>>>>> and AutoFS, but i got a problem: If NFS server goes down i get a
>>>>>>>>>>>>>>>>>>>>> LOOOOOOONG
>>>>>>>>>>>>>>>>>>>>> mount timeout on CentOS 5.3 (updated) NFSv4 client...
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Since i need mount some (3 to 6) dirs at user logon process, if
>>>>>>>>>>>>>>>>>>>>> mount
>>>>>>>>>>>>>>>>>>>>> hangs,
>>>>>>>>>>>>>>>>>>>>> user logon hangs. Then i want configure it to timeout (if server
>>>>>>>>>>>>>>>>>>>>> down)
>>>>>>>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>>>>>>>> 10-15 secs (MAX) on each mount attempt.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I already make a lab and tried a LOT of combinations, there my
>>>>>>>>>>>>>>>>>>>>> findings
>>>>>>>>>>>>>>>>>>>>> (server DOWN IP: 172.16.0.10 / client IP: 172.16.1.10) using
>>>>>>>>>>>>>>>>>>>>> basic
>>>>>>>>>>>>>>>>>>>>> command
>>>>>>>>>>>>>>>>>>>>> (time mount 172.16.0.10:/remotedir /localdir/ -t nfs4 -o
>>>>>>>>>>>>>>>>>>>>> sec=krb5,proto=<tcp/udp>) from NFS client:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> - Once i try access mount point using AutoFS (proto=tcp OR
>>>>>>>>>>>>>>>>>>>>> proto=udp)
>>>>>>>>>>>>>>>>>>>>> it
>>>>>>>>>>>>>>>>>>>>> hangs for 189 secs (3m9s: real  3m9.001s)  until show error
>>>>>>>>>>>>>>>>>>>>> (mount:
>>>>>>>>>>>>>>>>>>>>> mount to
>>>>>>>>>>>>>>>>>>>>> NFS server '172.16.0.10' failed: timed out (giving up))
>>>>>>>>>>>>>>>>>>> Sounds like you're hitting the server's grace period.
>>>>>>>>>>>>>>>>>> I thought he was describing a situation where the server the server
>>>>>>>>>>>>>>>>>> is completely gone and isn't coming back, and wondering how to make
>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>> mount fail faster.  But I may be misunderstanding.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --b.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>>>>>>>> linux-nfs" in
>>>>>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Chuck Lever
>>>>>>>>>>> chuck[dot]lever[at]oracle[dot]com
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>
>>>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html