Filled bug report: https://bugzilla.redhat.com/show_bug.cgi?id=517349 Thanks! 2009/8/13 Carlos André <candrecn@xxxxxxxxx>: > 2009/8/13 Ian Kent <ikent@xxxxxxxxxx>: >> Carlos André wrote: >>> Today (2009-08-12) I'm using: >>> kernel-2.6.18-128.2.1.el5 >>> autofs-5.0.1-0.rc2.102.el5_3.1 >> >> Thanks, >> >> My mistake, the wait time I was referring to is used for umounts during >> expires and is present in rev rc2.102. >> >> It shouldn't be hard to add this for mount as well. >> Would you like me to put something together? > > Sure! that 'll help me a lot (and for sure another ppl) :) Thanks :) > >> >> Probably would be good to test something out to see if we can make a >> difference with the killing mount after some configured timeout but, if >> we make progress, probably the best way to deal with it is for you to >> log a bug against rhel-5 so I can get it committed to the rhel package. >> The possible issue is that I'm not sure if the RPC subsystem in the >> above rhel kernel will respond well to process death with potential >> outstanding requests. But we'll see. > > Ok, on my way :) > > Thanks a lot! > >> >>> >>> >>> Look my last test: >>> -------------------------------------------------------------- >>> [root@KSTATION areas]# time ls testdown >>> ls: testdown: No such file or directory >>> >>> real 3m9.025s >>> user 0m0.000s >>> sys 0m0.002s >>> >>> >>> >>> >>> Aug 12 12:57:07 KSTATION automount[15471]: sun_mount: parse(sun): >>> mounting root /misc/areas, mountpoint testdown, what >>> 1.2.3.4:/areas/testdown, fstype nfs4, options >>> acl,sec=krb5p,proto=tcp,retry=0 >>> Aug 12 12:57:07 KSTATION automount[15471]: do_mount: >>> 1.2.3.4:/areas/testdown /misc/areas/testdown type nfs4 options >>> acl,sec=krb5p,proto=tcp,retry=0 using module nfs4 >>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs): >>> root=/misc/areas name=testdown what=1.2.3.4:/areas/testdown, >>> fstype=nfs4, options=acl,sec=krb5p,proto=tcp,retry=0 >>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs): >>> nfs options="acl,sec=krb5p,proto=tcp,retry=0", nosymlink=0, ro=0 >>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs): >>> calling mkdir_path /misc/areas/testdown >>> Aug 12 12:57:07 KSTATION automount[15471]: mount_mount: mount(nfs): >>> calling mount -t nfs4 -s -o acl,sec=krb5p,proto=tcp,retry=0 >>> 1.2.3.4:/areas/testdown /misc/areas/testdown >>> Aug 12 12:58:12 KSTATION automount[15471]: st_expire: state 1 path /misc >>> Aug 12 12:58:12 KSTATION automount[15471]: expire_proc: exp_proc = >>> 3078093712 path /misc >>> Aug 12 12:58:13 KSTATION automount[15471]: expire_proc_indirect: 2 >>> submounts remaining in /misc >>> Aug 12 12:58:13 KSTATION automount[15471]: expire_cleanup: got thid >>> 3078093712 path /misc stat 3 >>> Aug 12 12:58:13 KSTATION automount[15471]: expire_cleanup: sigchld: >>> exp 3078093712 finished, switching from 2 to 1 >>> Aug 12 12:58:13 KSTATION automount[15471]: st_ready: st_ready(): state >>> = 2 path /misc >>> Aug 12 12:59:28 KSTATION automount[15471]: st_expire: state 1 path /misc >>> Aug 12 12:59:28 KSTATION automount[15471]: expire_proc: exp_proc = >>> 3078093712 path /misc >>> Aug 12 12:59:28 KSTATION automount[15471]: expire_proc_indirect: 2 >>> submounts remaining in /misc >>> Aug 12 12:59:28 KSTATION automount[15471]: expire_cleanup: got thid >>> 3078093712 path /misc stat 3 >>> Aug 12 12:59:28 KSTATION automount[15471]: expire_cleanup: sigchld: >>> exp 3078093712 finished, switching from 2 to 1 >>> Aug 12 12:59:28 KSTATION automount[15471]: st_ready: st_ready(): state >>> = 2 path /misc >>> Aug 12 13:00:16 KSTATION automount[15471]: >> mount: mount to NFS >>> server '1.2.3.4' failed: timed out (giving up). >>> Aug 12 13:00:16 KSTATION automount[15471]: mount(nfs): nfs: mount >>> failure 1.2.3.4:/areas/testdown on /misc/areas/testdown >>> Aug 12 13:00:16 KSTATION automount[15471]: send_fail: token = 17 >>> Aug 12 13:00:16 KSTATION automount[15471]: failed to mount /misc/areas/testdown >>> Aug 12 13:00:43 KSTATION automount[15471]: st_expire: state 1 path /misc >>> -------------------------------------------------------------- >>> >>> 2009/8/12 Ian Kent <ikent@xxxxxxxxxx>: >>>> Carlos André wrote: >>>>> Hi Ian, >>>>> I'm getting crazy trying put "retry=" to work on mount... this option >>>>> just DONT WORK if use proto=tcp and/OR kerberos (sec=krb5/krb5i/krb5p) >>>>> like you can see on my previous emails... >>>> Right, my mistake for not looking closely enough at post. >>>> >>>> Maybe this is related to the same sort of problem we had with mount in >>>> the past, before the options parsing went into the kernel, where other >>>> services, like portmapper (or rpcbind), were being done with different >>>> timeout parameters before the RPC calls for mounting. That's just an >>>> example as NFSv4 shouldn't be sensitive to portmapper anyway. >>>> >>>> But what version of autofs and kernel did you say you were using? >>>> >>>>> I appreciate any help. >>>>> >>>>> Carlos. >>>>> >>>>> >>>>> 2009/8/12 Ian Kent <ikent@xxxxxxxxxx>: >>>>>> Chuck Lever wrote: >>>>>>> On Aug 11, 2009, at 8:41 AM, Carlos André wrote: >>>>>>>> This long timeout is good if workstation need mount a critical >>>>>>>> directory using /etc/fstab on boot (for example).. >>>>>>>> But in my case, using this loooong timeout doesnt make any sense, >>>>>>>> since autofs retry mount directory on-access. This in fact gives me >>>>>>>> alot of headaches, coz user login 'll just hangs if one server goes >>>>>>>> down for any reason, and will again hangs if user try access directory >>>>>>>> pointing to a NFS down server... >>>>>>> "retry=0" means the mount command will fail as soon as the first >>>>>>> mount(2) system call fails. When you set SYN retries to 1, this means >>>>>>> after 9 seconds, the connect fails, and that causes the mount(2) system >>>>>>> call to fail. >>>>>>> >>>>>>> Recent conversations with Ian suggested that a long timeout was desired >>>>>>> for automounter as well as other cases. Ian, is there something else we >>>>>>> need to consider to determine the correct retry timeout for NFS/TCP >>>>>>> mount points handled via automounter? How should mount.nfs wait so we >>>>>>> don't make other use cases worse? (Looks like most of the history is >>>>>>> intact below). >>>>>> Of course we know that autofs is entirely at the mercy of mount(8) (and >>>>>> mount.nfs in particular). This has always been a difficult situation for >>>>>> the automounter because interactive mount invocations should wait. But I >>>>>> believe automount mounts should always time out quickly, but that leads >>>>>> to its own set of problems, especially when home directories are concerned. >>>>>> >>>>>> I think adding "retry=0" is the right thing to do myself but I'm not >>>>>> certain that will work as we expect. I'll have to do some experimentation. >>>>>> >>>>>>> How long do you think is appropriate for the automounter to wait if the >>>>>>> server is down, in your case, Carlos? >>>>>>> >>>>>>>> Am losing something or there have was something weirdo...!? >>>>>>>> ------------------------------------------------ >>>>>>>> [root@KSTATION ~]# echo 5 > /proc/sys/net/ipv4/tcp_syn_retries [DEFAULT] >>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o >>>>>>>> proto=tcp,retry=1 >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>> >>>>>>>> real 3m9.000s >>>>>>>> user 0m0.002s >>>>>>>> sys 0m0.001s >>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o >>>>>>>> sec=krb5p,proto=tcp,retry=1 >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>> >>>>>>>> real 3m9.000s >>>>>>>> user 0m0.000s >>>>>>>> sys 0m0.002s >>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o >>>>>>>> proto=tcp,retry=0 >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>> >>>>>>>> real 3m9.001s >>>>>>>> user 0m0.000s >>>>>>>> sys 0m0.003s >>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o >>>>>>>> sec=krb5p,proto=tcp,retry=0 >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>> >>>>>>>> real 3m9.001s >>>>>>>> user 0m0.002s >>>>>>>> sys 0m0.001s >>>>>>>> >>>>>>>> [root@KSTATION ~]# echo 1 > /proc/sys/net/ipv4/tcp_syn_retries [ 5 to 1 ] >>>>>>>> >>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o >>>>>>>> proto=tcp,retry=1 >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 6] >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>> >>>>>>>> real 1m3.002s >>>>>>>> user 0m0.000s >>>>>>>> sys 0m0.002s >>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o >>>>>>>> sec=krb5p,proto=tcp,retry=1 >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 13] >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>> >>>>>>>> real 2m6.000s >>>>>>>> user 0m0.000s >>>>>>>> sys 0m0.002s >>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o >>>>>>>> proto=tcp,retry=0 >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>> >>>>>>>> real 0m9.003s >>>>>>>> user 0m0.001s >>>>>>>> sys 0m0.002s >>>>>>>> [root@KSTATION ~]# time mount 1.2.3.4:/blabla /tmp/ -t nfs4 -o >>>>>>>> sec=krb5p,proto=tcp,retry=0 >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). [x 13] >>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>> >>>>>>>> real 2m6.001s >>>>>>>> user 0m0.001s >>>>>>>> sys 0m0.002s >>>>>>>> [root@KSTATION ~]# >>>>>>>> ------------------------------------------------ >>>>>>>> max timeout goes to 2m6s changing tcp_syn_retries from 5 to 1... and >>>>>>>> using retry=0 without kerberos I got only 9s... >>>>>>>> >>>>>>>> *sigh* >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2009/8/10 Chuck Lever <chuck.lever@xxxxxxxxxx>: >>>>>>>>> On Aug 10, 2009, at 4:05 PM, Carlos André wrote: >>>>>>>>>> Something funny: Using default tcp_syn_retries (5) i got >>>>>>>>>> "3,6,12,24,48,96" secs interval... but if i change tcp_syn_retries to >>>>>>>>>> 1 i got "3,6,3,6,3,6..." secs interval... >>>>>>>>> Right. Normally the RPC client calls the kernel's socket connect >>>>>>>>> function, >>>>>>>>> which does 6 SYN retries. That one call usually takes longer than >>>>>>>>> the RPC >>>>>>>>> client's connect timeout, so it only makes one connect call, and then >>>>>>>>> fails. >>>>>>>>> >>>>>>>>> Reducing the number of SYN retries per connect attempt causes the RPC >>>>>>>>> client >>>>>>>>> to retry the connect call until its connect timeout expires. Each >>>>>>>>> connect >>>>>>>>> call resets the SYN timeout to 3 seconds. >>>>>>>>> >>>>>>>>>> [root@KSERVER mnt]# time mount 1.2.3.4:/blabla tmp/ -t nfs4 -o >>>>>>>>>> sec=krb5p,proto=tcp >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>>>> >>>>>>>>>> real 3m9.000s >>>>>>>>>> user 0m0.000s >>>>>>>>>> sys 0m0.002s >>>>>>>>>> >>>>>>>>>> [root@KSERVER /]# echo 1 > /proc/sys/net/ipv4/tcp_syn_retries >>>>>>>>>> [root@KSERVER mnt]# time mount 1.2.3.4:/blabla tmp/ -t nfs4 -o >>>>>>>>>> sec=krb5p,proto=tcp ("retry=1" = no change) >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (retrying). >>>>>>>>>> mount: mount to NFS server '1.2.3.4' failed: timed out (giving up). >>>>>>>>>> >>>>>>>>>> real 2m6.004s >>>>>>>>>> user 0m0.000s >>>>>>>>>> sys 0m0.004s >>>>>>>>>> >>>>>>>>>> (3,6,3,6... secs interval) >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2009/8/10 Carlos André <candrecn@xxxxxxxxx>: >>>>>>>>>>> No, i'm just using packages from CentOS repo... >>>>>>>>>>> >>>>>>>>>>> And u're right about expo retries... with tcpdump i've monitored >>>>>>>>>>> traffic and i got SYN retries in 3, 6, 12, 24, 48, 96 secs on port >>>>>>>>>>> 2049... >>>>>>>>>>> I tried use "retry=1" option on mount without any change... I dont >>>>>>>>>>> want change source or tcp timers... just NFSv4 client. >>>>>>>>>>> >>>>>>>>>>> 2009/8/10 Chuck Lever <chuck.lever@xxxxxxxxxx>: >>>>>>>>>>>> On Aug 10, 2009, at 2:29 PM, Carlos André wrote: >>>>>>>>>>>>> Bruce, no... you're right. I'm describing a situation where my >>>>>>>>>>>>> server >>>>>>>>>>>>> died... i need mount fail faster (10 or 15 secs max) than 3 minutes >>>>>>>>>>>>> and 9 seconds... >>>>>>>>>>>> The 189 second timeout is likely how long it takes the kernel to >>>>>>>>>>>> give up >>>>>>>>>>>> trying to connect a TCP socket to the server (6 SYN attempts with >>>>>>>>>>>> exponential retries, or something like that). For stock CentOS >>>>>>>>>>>> 5.3, I >>>>>>>>>>>> think >>>>>>>>>>>> user space does only a DNS lookup for normal NFSv4 mounts -- the >>>>>>>>>>>> kernel >>>>>>>>>>>> just >>>>>>>>>>>> tries to connect a TCP socket to port 2049, with no preceding rpcbind >>>>>>>>>>>> request. >>>>>>>>>>>> >>>>>>>>>>>> Carlos, let us know if you have replaced any NFS-related CentOS >>>>>>>>>>>> components >>>>>>>>>>>> (kernel, nfs-utils) with something you've built yourself. >>>>>>>>>>>> >>>>>>>>>>>>> 2009/8/7 J. Bruce Fields <bfields@xxxxxxxxxxxx>: >>>>>>>>>>>>>> On Fri, Aug 07, 2009 at 09:42:18AM +0300, Benny Halevy wrote: >>>>>>>>>>>>>>> On Aug. 07, 2009, 3:18 +0300, Carlos André <candrecn@xxxxxxxxx> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> Anyone ? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2009/7/29 Carlos André <candrecn@xxxxxxxxx>: >>>>>>>>>>>>>>>>> PPL, I need put a CentOS 5.3 (updated) NFSv4 server to work with >>>>>>>>>>>>>>>>> Kerberos >>>>>>>>>>>>>>>>> and AutoFS, but i got a problem: If NFS server goes down i get a >>>>>>>>>>>>>>>>> LOOOOOOONG >>>>>>>>>>>>>>>>> mount timeout on CentOS 5.3 (updated) NFSv4 client... >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Since i need mount some (3 to 6) dirs at user logon process, if >>>>>>>>>>>>>>>>> mount >>>>>>>>>>>>>>>>> hangs, >>>>>>>>>>>>>>>>> user logon hangs. Then i want configure it to timeout (if server >>>>>>>>>>>>>>>>> down) >>>>>>>>>>>>>>>>> after >>>>>>>>>>>>>>>>> 10-15 secs (MAX) on each mount attempt. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I already make a lab and tried a LOT of combinations, there my >>>>>>>>>>>>>>>>> findings >>>>>>>>>>>>>>>>> (server DOWN IP: 172.16.0.10 / client IP: 172.16.1.10) using >>>>>>>>>>>>>>>>> basic >>>>>>>>>>>>>>>>> command >>>>>>>>>>>>>>>>> (time mount 172.16.0.10:/remotedir /localdir/ -t nfs4 -o >>>>>>>>>>>>>>>>> sec=krb5,proto=<tcp/udp>) from NFS client: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> - Once i try access mount point using AutoFS (proto=tcp OR >>>>>>>>>>>>>>>>> proto=udp) >>>>>>>>>>>>>>>>> it >>>>>>>>>>>>>>>>> hangs for 189 secs (3m9s: real 3m9.001s) until show error >>>>>>>>>>>>>>>>> (mount: >>>>>>>>>>>>>>>>> mount to >>>>>>>>>>>>>>>>> NFS server '172.16.0.10' failed: timed out (giving up)) >>>>>>>>>>>>>>> Sounds like you're hitting the server's grace period. >>>>>>>>>>>>>> I thought he was describing a situation where the server the server >>>>>>>>>>>>>> is completely gone and isn't coming back, and wondering how to make >>>>>>>>>>>>>> the >>>>>>>>>>>>>> mount fail faster. But I may be misunderstanding. >>>>>>>>>>>>>> >>>>>>>>>>>>>> --b. >>>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>>>>> linux-nfs" in >>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>> -- >>>>>>>>>>>> Chuck Lever >>>>>>>>>>>> chuck[dot]lever[at]oracle[dot]com >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> -- >>>>>>>>> Chuck Lever >>>>>>>>> chuck[dot]lever[at]oracle[dot]com >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> -- >>>>>>> Chuck Lever >>>>>>> chuck[dot]lever[at]oracle[dot]com >>>>>>> >>>>>>> >>>>>>> >>>> >> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html