Re: v3 client processes hung in rpc_wait_bit_killable

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Update: we had another similar hang and this time I caught it fairly
quickly, so the "netstat -atu" output may be more useful.  See inline
below...

On 01/31/2013 03:58 PM, Nathan Dauchy - NOAA Affiliate wrote:
> Greetings,
> 
> On our Linux HPC cluster, one or more of the client nodes occasionally
> has a problem with all access to an NFS mount hanging.  Can anyone
> here point out a possible cause of the symptoms below or offer
> suggestions for other tools or debugging steps I should try?  Any new
> ideas or hints are appreciated!
> 
> The basic symptom is that all processes which access the NFS file
> system hang in "D" state.  Dumping the call trace shows that they are
> stuck in rpc_wait_bit_killable().  The processes can indeed be
> terminated with "kill -9".  If I kill all hung processes, I can umount
> the file system and remount it and then access works again.
> 
> We can run under "normal" user load for up to 3 days before 1-5 client
> nodes experience a similar problem.   When it happens, the 100's of
> other clients show no sign of a problem with their NFS mounts.  There
> is no pattern to exactly which client will hang, though one complex
> user workflow seems to trigger it more than others.
> 
> Here is an example of the hung processes:
> 
> # ps -eo pid,stat,args,wchan | grep D
>    PID STAT COMMAND                     WCHAN
>   2192 Ds   -bash                       rpc_wait_bit_killable
>   2408 Ds   bash -c /home/ndauchy/bin/f rpc_wait_bit_killable
>  15926 D    df -P                       rpc_wait_bit_killable
>  36857 S+   grep D                      pipe_wait
> 
> By running "df" on each NFS mount one by one I was able to narrow down
> which mount was hung to this one:
> 
> # mount | grep home
> 10.178.17.21:/vol/home on /misc/whome type nfs
> (rw,relatime,vers=3,rsize=8192,wsize=8192,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.178.17.21,mountvers=3,mountport=4046,mountproto=udp,local_lock=all,addr=10.178.17.21)
> 
> There is no TCP connection to that server that I can see:
> 
> # netstat -atu
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address               Foreign Address    State
> tcp        0      0 *:ssh                       *:*               LISTEN
> tcp        0      0 s320-ib0:panasas            *:*               LISTEN
> tcp        0      0 s320:panasas                *:*               LISTEN
> tcp        0      0 *:15002                     *:*               LISTEN
> tcp        0      0 *:15003                     *:*               LISTEN
> tcp        0      0 localhost.localdomain:15005 *:*               LISTEN
> tcp        0      0 s320-ib0:45637              panmgmt5:10622    ESTABLISHED
> tcp        0      0 s320:ssh                    fe3:59631         ESTABLISHED
> tcp        0      0 s320:849                    wms1314-nfs:nfs   ESTABLISHED
> tcp        0      0 s320:ssh                    fe5:47363         ESTABLISHED
> tcp        0      0 *:ssh                       *:*               LISTEN
> udp        0      0 s320-ib0:ntp                *:*
> udp        0      0 s320:ntp                    *:*
> udp        0      0 localhost.localdomain:ntp   *:*
> udp        0      0 *:ntp                       *:*
> udp        0      0 *:52539                     *:*
> udp        0      0 fe80::21e:6703:3c:f00d:ntp  *:*
> udp        0      0 fe80::21e:67ff:fe3c:f008:ntp *:*
> udp        0      0 ::1:ntp                     *:*
> udp        0      0 *:ntp                       *:*

This time there is still no connection to the NFS server, but there are
many conections in CLOSE_WAIT and FIN_WAIT.:

# netstat -atu
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address
    State
tcp        0      0 *:ssh                       *:*
    LISTEN
tcp        0      0 n109-ib0:panasas            *:*
    LISTEN
tcp        0      0 n109:panasas                *:*
    LISTEN
tcp        0      0 *:15002                     *:*
    LISTEN
tcp        0      0 *:15003                     *:*
    LISTEN
tcp        0      0 n109:15003                  n98:975
    SYN_RECV
tcp        0      0 n109:15003                  n12:nnsp
    SYN_RECV
tcp        0      0 n109:15003                  n7:timbuktu
    SYN_RECV
tcp        0      0 n109:15003                  n56:netsc-prod
    SYN_RECV
tcp        0      0 n109:15003                  n58:at-echo
    SYN_RECV
tcp        0      0 n109:15003                  n116:815
    SYN_RECV
tcp        0      0 n109:15003                  n73:1001
    SYN_RECV
tcp        0      0 n109:15003                  n5:768
    SYN_RECV
tcp        0      0 n109:15003                  n83:276
    SYN_RECV
tcp        0      0 n109:15003                  n10:867
    SYN_RECV
tcp        0      0 n109:15003                  n23:csi-sgwp
    SYN_RECV
tcp        0      0 n109:15003                  n48:ticf-1
    SYN_RECV
tcp        0      0 localhost.localdomain:15005 *:*
    LISTEN
tcp        0     86 n109:932                    n131:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n99:nest-protocol
    CLOSE_WAIT
tcp       79      0 n109:15003                  n9:macon-tcp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n131:cvc_hostd
    CLOSE_WAIT
tcp        0     86 n109:331                    n11:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n75:963
    CLOSE_WAIT
tcp       79      0 n109:15003                  n98:openport
    CLOSE_WAIT
tcp       79      0 n109:15003                  n9:ipcserver
    CLOSE_WAIT
tcp        1      0 n109:15003                  n93:979
    CLOSE_WAIT
tcp       79      0 n109:15003                  n126:820
    CLOSE_WAIT
tcp       79      0 n109:15003                  n121:at-7
    CLOSE_WAIT
tcp       79      0 n109:15003                  n45:sift-uft
    CLOSE_WAIT
tcp       79      0 n109:15003                  n121:scc-security
    CLOSE_WAIT
tcp       81      0 n109:15003                  n30:229
    CLOSE_WAIT
tcp        0     86 n109:utmpcd                 n16:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n95:241
    CLOSE_WAIT
tcp        0     89 n109:zserv                  n145:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n116:878
    CLOSE_WAIT
tcp       79      0 n109:15003                  n93:vmnet
    CLOSE_WAIT
tcp       79      0 n109:15003                  n10:960
    CLOSE_WAIT
tcp       79      0 n109:15003                  n140:sntp-heartbeat
    CLOSE_WAIT
tcp       79      0 n109:15003                  n45:nas
    CLOSE_WAIT
tcp        0     86 n109:rmc                    n55:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n103:ddm-rdb
    CLOSE_WAIT
tcp       79      0 n109:15003                  n77:philips-vc
    CLOSE_WAIT
tcp       79      0 n109:15003                  n85:883
    CLOSE_WAIT
tcp       81      0 n109:15003                  n52:con
    CLOSE_WAIT
tcp        0     86 n109:vacdsm-sws             n36:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n73:iiop
    CLOSE_WAIT
tcp        0     88 n109:946                    n5:15003
    FIN_WAIT1
tcp        0     86 n109:sgcp                   n108:15003
    FIN_WAIT1
tcp       80      0 n109:15003                  n145:iris-lwz
    CLOSE_WAIT
tcp       79      0 n109:15003                  n66:885
    CLOSE_WAIT
tcp       81      0 n109:15003                  n7:pawserv
    CLOSE_WAIT
tcp       80      0 n109:15003                  n53:opalis-rdv
    CLOSE_WAIT
tcp       81      0 n109:15003                  n115:hap
    CLOSE_WAIT
tcp       79      0 n109:15003                  n65:remotefs
    CLOSE_WAIT
tcp       81      0 n109:15003                  n118:914c/g
    CLOSE_WAIT
tcp       79      0 n109:15003                  n11:sgcp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n131:mftp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n65:3com-amp3
    CLOSE_WAIT
tcp       79      0 n109:15003                  n101:vatp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n66:mftp
    CLOSE_WAIT
tcp       81      0 n109:15003                  n7:708
    CLOSE_WAIT
tcp       81      0 n109:15003                  n115:iso-ill
    CLOSE_WAIT
tcp       80      0 n109:15003                  n47:864
    CLOSE_WAIT
tcp       79      0 n109:15003                  n41:uma
    CLOSE_WAIT
tcp       79      0 n109:15003                  n6:vnas
    CLOSE_WAIT
tcp       79      0 n109:15003                  n49:exp1
    CLOSE_WAIT
tcp        0      0 n109-ib0:56970              panmgmt5:panasas
    TIME_WAIT
tcp       79      0 n109:15003                  n36:732
    CLOSE_WAIT
tcp        0      0 n109:919                    n19:15003
    FIN_WAIT2
tcp       81      0 n109:15003                  n5:corerjd
    CLOSE_WAIT
tcp       80      0 n109:15003                  n53:at-5
    CLOSE_WAIT
tcp       79      0 n109:15003                  n83:namp
    CLOSE_WAIT
tcp        1      0 n109:ssh                    n145:36357
    CLOSE_WAIT
tcp       79      0 n109:15003                  n127:917
    CLOSE_WAIT
tcp       79      0 n109:15003                  n11:vpps-qua
    CLOSE_WAIT
tcp       79      0 n109:15003                  n8:rlzdbase
    CLOSE_WAIT
tcp       81      0 n109:15003                  n118:zserv
    CLOSE_WAIT
tcp       79      0 n109:15003                  n49:telnets
    CLOSE_WAIT
tcp       81      0 n109:15003                  n52:cadlock
    CLOSE_WAIT
tcp       79      0 n109:15003                  n41:texar
    CLOSE_WAIT
tcp       79      0 n109:15003                  n58:937
    CLOSE_WAIT
tcp        0     88 n109:hassle                 n43:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n47:entrust-kmsh
    CLOSE_WAIT
tcp        0      1 n109:prm-nm                 n93:15003
    SYN_SENT
tcp        0      0 n109:1017                   wms56-nfs:nfs
    ESTABLISHED
tcp       79      0 n109:15003                  n23:asip-webadmin
    CLOSE_WAIT
tcp       79      0 n109:15003                  n99:879
    CLOSE_WAIT
tcp       79      0 n109:15003                  n101:jargon
    CLOSE_WAIT
tcp       79      0 n109:15003                  n62:snare
    CLOSE_WAIT
tcp        0      0 n109:ssh                    fe5:42381
    ESTABLISHED
tcp       79      0 n109:15003                  n6:iris-xpc
    CLOSE_WAIT
tcp       79      0 n109:15003                  n16:931
    CLOSE_WAIT
tcp       79      0 n109:15003                  n103:urm
    CLOSE_WAIT
tcp       79      0 n109:15003                  n49:mftp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n98:npmp-gui
    CLOSE_WAIT
tcp        0     86 n109:981                    n48:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n36:telnets
    CLOSE_WAIT
tcp       81      0 n109:15003                  n68:asa
    CLOSE_WAIT
tcp       79      0 n109:15003                  n11:npmp-gui
    CLOSE_WAIT
tcp       81      0 n109:15003                  n5:813
    CLOSE_WAIT
tcp       79      0 n109:15003                  n116:avian
    CLOSE_WAIT
tcp       79      0 n109:15003                  n127:new-rwho
    CLOSE_WAIT
tcp       79      0 n109:15003                  n9:ulistproc
    CLOSE_WAIT
tcp       79      0 n109:15003                  n75:k-block
    CLOSE_WAIT
tcp       79      0 n109:15003                  n26:819
    CLOSE_WAIT
tcp       79      0 n109:15003                  n37:vpps-qua
    CLOSE_WAIT
tcp       79      0 n109:15003                  n6:qrh
    CLOSE_WAIT
tcp        0     86 n109:bgmp                   n65:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n101:bnet
    CLOSE_WAIT
tcp       79      0 n109:15003                  n62:unify
    CLOSE_WAIT
tcp       79      0 n109:15003                  n140:microsoft-ds
    CLOSE_WAIT
tcp       80      0 n109:15003                  n145:ocs_cmu
    CLOSE_WAIT
tcp       79      0 n109:15003                  n81:decbsrv
    CLOSE_WAIT
tcp       81      0 n109:15003                  n5:ddm-dfm
    CLOSE_WAIT
tcp       79      0 n109:15003                  n81:decauth
    CLOSE_WAIT
tcp       79      0 n109:15003                  n26:1002
    CLOSE_WAIT
tcp       81      0 n109:15003                  n118:lanserver
    CLOSE_WAIT
tcp       81      0 n109:15003                  n118:mac-srvr-admin
    CLOSE_WAIT
tcp       79      0 n109:15003                  n8:852
    CLOSE_WAIT
tcp       79      0 n109:15003                  n41:hassle
    CLOSE_WAIT
tcp       79      0 n109:15003                  n37:743
    CLOSE_WAIT
tcp       80      0 n109:15003                  n145:olsr
    CLOSE_WAIT
tcp       81      0 n109:15003                  n120:bhfhs
    CLOSE_WAIT
tcp       79      0 n109:15003                  n101:pkix-timestamp
    CLOSE_WAIT
tcp       79      0 n109:15003                  n56:298
    CLOSE_WAIT
tcp       79      0 n109:15003                  n86:970
    CLOSE_WAIT
tcp       79      0 n109:15003                  n10:netconf-ssh
    CLOSE_WAIT
tcp       81      0 n109:15003                  n122:netrcs
    CLOSE_WAIT
tcp       79      0 n109:15003                  n95:34617
    CLOSE_WAIT
tcp       23      0 n109:15002                  jetbqs3:818
    CLOSE_WAIT
tcp       79      0 n109:15003                  n86:hp-collector
    CLOSE_WAIT
tcp       79      0 n109:15003                  n62:815
    CLOSE_WAIT
tcp       81      0 n109:15003                  n122:sanity
    CLOSE_WAIT
tcp       80      0 n109:15003                  n145:krb5_prop
    CLOSE_WAIT
tcp       81      0 n109:15003                  n7:passgo
    CLOSE_WAIT
tcp       79      0 n109:15003                  n81:839
    CLOSE_WAIT
tcp       79      0 n109:15003                  n16:939
    CLOSE_WAIT
tcp       81      0 n109:15003                  n5:338
    CLOSE_WAIT
tcp        0     86 n109:335                    n45:15003
    FIN_WAIT1
tcp       52      0 n109:15002                  jetbqs3:845
    ESTABLISHED
tcp       79      0 n109:15003                  n86:923
    CLOSE_WAIT
tcp        0      0 n109-ib0:58388              panmgmt4:panasas
    TIME_WAIT
tcp       81      0 n109:15003                  n7:urm
    CLOSE_WAIT
tcp       79      0 n109:15003                  n106:matip-type-a
    CLOSE_WAIT
tcp       79      0 n109:15003                  n48:928
    CLOSE_WAIT
tcp       79      0 n109:15003                  n56:utime
    CLOSE_WAIT
tcp       79      0 n109:15003                  n11:nest-protocol
    CLOSE_WAIT
tcp       79      0 n109:15003                  n65:imaps
    CLOSE_WAIT
tcp       79      0 n109:15003                  n23:725
    CLOSE_WAIT
tcp       79      0 n109:15003                  n127:datasurfsrvsec
    CLOSE_WAIT
tcp       79      0 n109:15003                  n37:quotad
    CLOSE_WAIT
tcp       79      0 n109:15003                  n55:hello-port
    CLOSE_WAIT
tcp       79      0 n109:15003                  n26:289
    CLOSE_WAIT
tcp       79      0 n109:15003                  n75:opalis-rdv
    CLOSE_WAIT
tcp        0      0 n109:45677                  10.178.123.66:10622
    ESTABLISHED
tcp       79      0 n109:15003                  n15:820
    CLOSE_WAIT
tcp       79      0 n109:15003                  n121:341
    CLOSE_WAIT
tcp       80      0 n109:15003                  n53:spsc
    CLOSE_WAIT
tcp        0     86 n109:hap                    n23:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n85:fln-spx
    CLOSE_WAIT
tcp       79      0 n109:15003                  n127:decvms-sysmgt
    CLOSE_WAIT
tcp       79      0 n109:15003                  n106:dn6-nlm-aud
    CLOSE_WAIT
tcp       79      0 n109:15003                  n10:sco-websrvrmg3
    CLOSE_WAIT
tcp       79      0 n109:15003                  n131:link
    CLOSE_WAIT
tcp       79      0 n109:15003                  n6:818
    CLOSE_WAIT
tcp       79      0 n109:15003                  n126:863
    CLOSE_WAIT
tcp       79      0 n109:15003                  n126:servstat
    CLOSE_WAIT
tcp        0     87 n109:commerce               n53:15003
    FIN_WAIT1
tcp       79      0 n109:15003                  n16:decauth
    CLOSE_WAIT
tcp       79      0 n109:15003                  n41:xact-backup
    CLOSE_WAIT
tcp        0      0 n109-ib0:60283              panmgmt6:panasas
    TIME_WAIT
tcp       79      0 n109:15003                  n85:spsc
    CLOSE_WAIT
tcp       79      0 n109:15003                  n75:banyan-rpc
    CLOSE_WAIT
tcp        0      0 *:ssh                       *:*
    LISTEN
udp        0      0 *:60495                     *:*

udp        0      0 *:863                       *:*

udp        0      0 n109-ib0:ntp                *:*

udp        0      0 n109:ntp                    *:*

udp        0      0 localhost.localdomain:ntp   *:*

udp        0      0 *:ntp                       *:*

udp        0      0 fe80::230:48ff:fec9:3b5e:ntp *:*

udp        0      0 fe80::230:48c9:78dc:1:ntp   *:*

udp        0      0 ::1:ntp                     *:*

udp        0      0 *:ntp                       *:*



> 
> Here are the call traces:
> 
> # echo w > /proc/sysrq-trigger
> 
> SysRq : Show Blocked State
>   task                        PC stack   pid father
> bash          D 0000000000000000     0  2192      1 0x00000004
>  ffff88042db47b28 0000000000000082 0000000000000000 ffff880400000000
>  ffff88042db47ad8 ffff88042db47dd8 ffff88042db47b98 ffff88042db47b88
>  ffff880121f7d098 ffff88042db47fd8 000000000000fb88 ffff880121f7d098
> Call Trace:
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
>  [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
>  [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>  [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
>  [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
>  [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>  [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
>  [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
>  [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
>  [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
>  [<ffffffffa00c4707>] nfs3_proc_getattr+0x47/0x90 [nfs]
>  [<ffffffffa00b23b3>] __nfs_revalidate_inode+0xe3/0x220 [nfs]
>  [<ffffffffa00b2696>] nfs_revalidate_inode+0x36/0x60 [nfs]
>  [<ffffffffa00b2776>] nfs_getattr+0x66/0x120 [nfs]
>  [<ffffffff81180571>] vfs_getattr+0x51/0x80
>  [<ffffffff81180600>] vfs_fstatat+0x60/0x80
>  [<ffffffff8118074b>] vfs_stat+0x1b/0x20
>  [<ffffffff81180774>] sys_newstat+0x24/0x50
>  [<ffffffff8117cba5>] ? fput+0x25/0x30
>  [<ffffffff811785cd>] ? filp_close+0x5d/0x90
>  [<ffffffff811786a5>] ? sys_close+0xa5/0x100
>  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> bash          D 0000000000000000     0  2408      1 0x00000000
>  ffff8802f1f4d898 0000000000000086 0000000000000082 0000000000000246
>  ffff8802f1f4d838 ffff8802f1f4d898 ffff880028043718 0000000300000001
>  ffff88043418b098 ffff8802f1f4dfd8 000000000000fb88 ffff88043418b098
> Call Trace:
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
>  [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
>  [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>  [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
>  [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
>  [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>  [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
>  [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
>  [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
>  [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
>  [<ffffffffa00c3d4c>] nfs3_proc_access+0xbc/0x180 [nfs]
>  [<ffffffffa00ab168>] nfs_do_access+0x198/0x420 [nfs]
>  [<ffffffffa000e765>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
>  [<ffffffffa000d770>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
>  [<ffffffffa00ab498>] nfs_permission+0xa8/0x1e0 [nfs]
>  [<ffffffff8118995d>] __link_path_walk+0xad/0x1030
>  [<ffffffff8118ab6a>] path_walk+0x6a/0xe0
>  [<ffffffff8118ad3b>] do_path_lookup+0x5b/0xa0
>  [<ffffffff8117c780>] ? get_empty_filp+0xa0/0x180
>  [<ffffffff8118bc6b>] do_filp_open+0xfb/0xd60
>  [<ffffffffa00af26a>] ? nfs_file_read+0xca/0x130 [nfs]
>  [<ffffffff81213426>] ? security_prepare_creds+0x16/0x20
>  [<ffffffff810996ef>] ? prepare_creds+0x8f/0xb0
>  [<ffffffff81183943>] open_exec+0x33/0x100
>  [<ffffffff81183aff>] do_execve+0xef/0x340
>  [<ffffffff8127ea8a>] ? strncpy_from_user+0x4a/0x90
>  [<ffffffff810095ea>] sys_execve+0x4a/0x80
>  [<ffffffff8100b54a>] stub_execve+0x6a/0xc0
> df            D 0000000000000000     0 15926  15915 0x00000000
>  ffff8801898f5ae8 0000000000000082 ffff8801898f5a48 ffffffff81116967
>  ffff8801898f5a68 ffffffff81116a25 ffff88032d29cdc0 ffffffffa00207a0
>  ffff88042be765f8 ffff8801898f5fd8 000000000000fb88 ffff88042be765f8
> Call Trace:
>  [<ffffffff81116967>] ? mempool_free_slab+0x17/0x20
>  [<ffffffff81116a25>] ? mempool_free+0x95/0xa0
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffffa000b972>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
>  [<ffffffff814feaaf>] __wait_on_bit+0x5f/0x90
>  [<ffffffffa000b930>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
>  [<ffffffff814feb58>] out_of_line_wait_on_bit+0x78/0x90
>  [<ffffffff81092110>] ? wake_bit_function+0x0/0x50
>  [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
>  [<ffffffffa000beb5>] __rpc_execute+0xf5/0x350 [sunrpc]
>  [<ffffffff81091f97>] ? bit_waitqueue+0x17/0xd0
>  [<ffffffffa000c171>] rpc_execute+0x61/0xa0 [sunrpc]
>  [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
>  [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
>  [<ffffffff8118a018>] ? __link_path_walk+0x768/0x1030
>  [<ffffffffa00c396d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
>  [<ffffffffa00c4797>] nfs3_proc_statfs+0x47/0x90 [nfs]
>  [<ffffffffa00b4cc6>] nfs_statfs+0x66/0x1a0 [nfs]
>  [<ffffffff811abe64>] statfs_by_dentry+0x74/0xa0
>  [<ffffffff811abf9b>] vfs_statfs+0x1b/0xb0
>  [<ffffffff811ac260>] do_statfs_native+0x20/0xb0
>  [<ffffffff811ac3f2>] sys_statfs+0x72/0xb0
>  [<ffffffff81500a7e>] ? do_device_not_available+0xe/0x10
>  [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> 

New traces look fairly similar:

SysRq : Show Blocked State
  task                        PC stack   pid father
sshd          D 0000000000000000     0 23812   1786 0x00000004
 ffff88032c5af928 0000000000000082 0000000000000000 0000000000000003
 0000000000000001 0000000000000282 ffff88032c5af8d8 ffffffff81052223
 ffff88031b1cdab8 ffff88032c5affd8 000000000000fb88 ffff88031b1cdab8
Call Trace:
 [<ffffffff81052223>] ? __wake_up+0x53/0x70
 [<ffffffffa000b710>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffffa000b752>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
 [<ffffffff814ead9f>] __wait_on_bit+0x5f/0x90
 [<ffffffffa000b710>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffff814eae48>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff810909d0>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
 [<ffffffffa000bc95>] __rpc_execute+0xf5/0x350 [sunrpc]
 [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa000bf51>] rpc_execute+0x61/0xa0 [sunrpc]
 [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
 [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
 [<ffffffffa00c129d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
 [<ffffffffa00c167c>] nfs3_proc_access+0xbc/0x180 [nfs]
 [<ffffffff81195540>] ? mntput_no_expire+0x30/0x110
 [<ffffffffa00a91b9>] nfs_do_access+0x199/0x3c0 [nfs]
 [<ffffffffa000e575>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
 [<ffffffffa000d550>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
 [<ffffffffa00a9488>] nfs_permission+0xa8/0x1e0 [nfs]
 [<ffffffff8118497d>] __link_path_walk+0xad/0x1030
 [<ffffffff81185b8a>] path_walk+0x6a/0xe0
 [<ffffffff81185d5b>] do_path_lookup+0x5b/0xa0
 [<ffffffff81177a50>] ? get_empty_filp+0xa0/0x180
 [<ffffffff81186c8b>] do_filp_open+0xfb/0xd60
 [<ffffffff81193272>] ? alloc_fd+0x92/0x160
 [<ffffffff81173a39>] do_sys_open+0x69/0x140
 [<ffffffff81088a2a>] ? sys_setresuid+0x15a/0x180
 [<ffffffff81173b50>] sys_open+0x20/0x30
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
bash          D 0000000000000000     0 24454      1 0x00000000
 ffff8802e8163898 0000000000000082 0000000000000000 0000000000000282
 0000000000000082 0000000000000000 ffff8802e8163848 ffff8802e81638a8
 ffff8803306885f8 ffff8802e8163fd8 000000000000fb88 ffff8803306885f8
Call Trace:
 [<ffffffffa000b710>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffffa000b752>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
 [<ffffffff814ead9f>] __wait_on_bit+0x5f/0x90
 [<ffffffffa000b710>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
 [<ffffffff814eae48>] out_of_line_wait_on_bit+0x78/0x90
 [<ffffffff810909d0>] ? wake_bit_function+0x0/0x50
 [<ffffffffa0000480>] ? call_reserve+0x0/0x60 [sunrpc]
 [<ffffffffa000bc95>] __rpc_execute+0xf5/0x350 [sunrpc]
 [<ffffffff81090857>] ? bit_waitqueue+0x17/0xd0
 [<ffffffffa000bf51>] rpc_execute+0x61/0xa0 [sunrpc]
 [<ffffffffa0002385>] rpc_run_task+0x75/0x90 [sunrpc]
 [<ffffffffa00024a2>] rpc_call_sync+0x42/0x70 [sunrpc]
 [<ffffffffa00c129d>] nfs3_rpc_wrapper.clone.0+0x3d/0xd0 [nfs]
 [<ffffffffa00c167c>] nfs3_proc_access+0xbc/0x180 [nfs]
 [<ffffffff81195540>] ? mntput_no_expire+0x30/0x110
 [<ffffffffa00a91b9>] nfs_do_access+0x199/0x3c0 [nfs]
 [<ffffffffa000e575>] ? generic_lookup_cred+0x15/0x20 [sunrpc]
 [<ffffffffa000d550>] ? rpcauth_lookupcred+0x70/0xc0 [sunrpc]
 [<ffffffffa00a9488>] nfs_permission+0xa8/0x1e0 [nfs]
 [<ffffffff8118497d>] __link_path_walk+0xad/0x1030
 [<ffffffff81185b8a>] path_walk+0x6a/0xe0
 [<ffffffff81185d5b>] do_path_lookup+0x5b/0xa0
 [<ffffffff81177a50>] ? get_empty_filp+0xa0/0x180
 [<ffffffff81186c8b>] do_filp_open+0xfb/0xd60
 [<ffffffff8104338c>] ? __do_page_fault+0x1ec/0x480
 [<ffffffff8120cbe6>] ? security_prepare_creds+0x16/0x20
 [<ffffffff8117ea23>] open_exec+0x33/0x100
 [<ffffffff8117ebbf>] do_execve+0xcf/0x2c0
 [<ffffffff810095ea>] sys_execve+0x4a/0x80
 [<ffffffff8100b4ca>] stub_execve+0x6a/0xc0


Does any of that provide more clues?

Are there other diagnostic tools I should be using?

Anything else I should check before rebooting the client?


Thanks in advance,
Nathan


> 
> When I tried "echo 1 > /proc/sys/sunrpc/rpc_debug" I got MANY messages
> like the following, so quickly set it back to 0 again.  I don't know
> how to tell whether the RPCs are for the hung mountpoint or a
> different one on the client, but did notice that the "xid" numbers are
> increasing, but the "req" number does not change.
> 
> Jan 31 01:13:11 s320 kernel: RPC: 55738 xprt_prepare_transmit
> Jan 31 01:13:13 s320 kernel: RPC: 55743 reserved req ffff88042be25000
> xid 4041e127
> Jan 31 01:13:13 s320 kernel: RPC: 55743 xprt_prepare_transmit
> Jan 31 01:13:13 s320 kernel: RPC: 55743 xprt_transmit(112)
> Jan 31 01:13:13 s320 kernel: RPC: 55743 xmit complete
> Jan 31 01:13:13 s320 kernel: RPC: 55743 xid 4041e127 complete (112
> bytes received)
> Jan 31 01:13:13 s320 kernel: RPC: 55743 release request ffff88042be25000
> Jan 31 01:13:13 s320 kernel: RPC: 55744 reserved req ffff88042be25000
> xid 4141e127
> Jan 31 01:13:13 s320 kernel: RPC: 55744 xprt_prepare_transmit
> Jan 31 01:13:13 s320 kernel: RPC: 55744 xprt_transmit(112)
> Jan 31 01:13:13 s320 kernel: RPC: 55744 xmit complete
> Jan 31 01:13:13 s320 kernel: RPC: 55744 xid 4141e127 complete (112
> bytes received)
> Jan 31 01:13:13 s320 kernel: RPC: 55744 release request ffff88042be25000
> Jan 31 01:13:14 s320 kernel: RPC: 55745 reserved req ffff88042be25000
> xid 4241e127
> Jan 31 01:13:14 s320 kernel: RPC: 55745 xprt_prepare_transmit
> Jan 31 01:13:14 s320 kernel: RPC: 55745 xprt_transmit(116)
> Jan 31 01:13:14 s320 kernel: RPC: 55745 xmit complete
> Jan 31 01:13:14 s320 kernel: RPC: 55745 xid 4241e127 complete (120
> bytes received)
> Jan 31 01:13:14 s320 kernel: RPC: 55745 release request ffff88042be25000
> Jan 31 01:13:14 s320 kernel: RPC: 55746 reserved req ffff88042be25000
> xid 4341e127
> Jan 31 01:13:14 s320 kernel: RPC: 55746 xprt_prepare_transmit
> Jan 31 01:13:14 s320 kernel: RPC: 55746 xprt_transmit(116)
> Jan 31 01:13:14 s320 kernel: RPC: 55746 xmit complete
> Jan 31 01:13:14 s320 kernel: RPC: 55746 xid 4341e127 complete (120
> bytes received)
> Jan 31 01:13:14 s320 kernel: RPC: 55746 release request ffff88042be25000
> Jan 31 01:13:14 s320 kernel: RPC: 55747 reserved req ffff88042be25000
> xid 4441e127
> 
> 
> The clients are running CentOS-6.3, and we have tried both kernel
> 2.6.32-279.2.1.el6 and 2.6.32-279.19.1.el6, and are using
> nfs-utils-1.2.3-26.el6.  We never saw the problem on the same hardware
> when running CentOS-5.5 and kernel 2.6.18-194.11.4.el5.
> 
> The server in this example is a NetApp FAS3140, but mounts to other
> servers (primarily running CentOS-5.7) have exhibited similar
> problems.
> 
> Running "tcpdump" shows no NFS traffic to the server when running a
> new command which should access the NFS mount.
> 
> We do not have a firewall in the way, and I don't believe that any
> other routing problems are blocking network traffic since "ping" and
> "showmount -e" from the client to the server respond normally.
> 
> Trying to create a new mount to any mountpoint on the same server
> hangs as well.  The call trace shows it in __d_instantiate(), called
> from rpc_wait_bit_killable().  Creating a new mount to a different
> server works fine.
> 
> And finally, here are even more details that may be unique to our
> configuration, and might confuse the issue, but I wanted to list them
> here just in case they are relevant:
> 
> * These nodes are fully diskless, mounting their root file system via
> NFS, and have no swap.
> 
> * We see similar hangs of processes accessing the root file system, as
> indicated by the call trace generated by alt-sysrq-w from the console.
> Unfortunately without a working root file system I can't login to the
> nodes to get more debugging info, so have decided to concentrate on
> the /home mount described above.
> 
> * Our site has 5 clusters all running the exact same (diskless) OS,
> ranging from 250 to 850 nodes.  The problem is triggered most
> frequently on a system with 8 cores/node + 24 GB memory, and very
> rarely on one with 16 cores/node + 32 GB memory.  So it may have
> something to do with memory pressure or with number of nodes used by a
> user's application at a fixed core count.  Of course I can't rule out
> differences in client NIC or cluster network capabilities impacting
> frequency either.
> 
> 
> Thanks for your time reading all this and any help you can offer!
> 
> -Nathan
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux