NFS client/sunrpc getting stuck on 2.6.36

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Still seeing all sorts of boxes fall over with 2.6.35 and 2.6.36 NFS.
Unfortunately, it doesn't happen all the time...only certain load
patterns seem to start it off.  Once it starts, I can't find a way to
make it recover without rebooting.

In this case, tcpdump and rpc/nfs debug logs show a repeating command,
but only 3-4 commands are being sent per second, even though all slots
are in use, there is a backlog, 300 processes in rpc_wait_bit_killable,
and the debug logs show that responses are coming back immediately. 
Weird congestion/backoff issue?

Over a few seconds, the only numbers changing from "nfsstat" are "calls",
"authrefresh", and "getattr".  Debug level 1 for nfs and rpc show this
repeating:

 NFS: permission(0:4c/5284877), mask=0x1, res=0
 RPC:  5778 reserved req ffff88012ae042d0 xid 64030e74
 RPC:  5778 xprt_prepare_transmit
 RPC:  5778 xprt_transmit(96)
 RPC:  5778 xmit complete
 RPC:  5778 xid 64030e74 complete (112 bytes received)
 RPC:  5778 release request ffff88012ae042d0
 NFS: nfs_update_inode(0:4c/3247737045 ct=1 info=0x7e7f)
 NFS: permission(0:4c/3247737045), mask=0x24, res=0
 NFS: dentry_delete(tmp/t.sh, 8)
 NFS: dentry_delete(tmp/9040-16697.done, 8)
 NFS: dentry_delete(tmp/9040-29371.done, 8)

The only thing that changes between repeats are the pid and xid.

Debug level 255 for nfs and rpc:

NFS call  getattr
RPC:       new task initialized, procpid 15460
RPC:       allocated task ffff8800c09f7e00
RPC:  5321 __rpc_execute flags=0x80
RPC:  5321 call_start nfs3 proc GETATTR (sync)
RPC:  5321 call_reserve (status 0)
RPC:  5321 reserved req ffff88012ae042d0 xid 9e010e74
RPC:  5321 call_reserveresult (status 0)
RPC:  5321 call_refresh (status 0)
RPC:  5321 looking up UNIX cred
RPC:       looking up UNIX cred
RPC:  5321 refreshing UNIX cred ffff880011ec1300
RPC:  5321 call_refreshresult (status 0)
RPC:  5321 call_allocate (status 0)
RPC:  5321 allocated buffer of size 552 at ffff880129b1d000
RPC:  5321 call_bind (status 0)
RPC:  5321 call_connect xprt ffff88012a74d000 is connected
RPC:  5321 call_transmit (status 0)
RPC:  5321 xprt_prepare_transmit
RPC:  5321 rpc_xdr_encode (status 0)
RPC:  5321 marshaling UNIX cred ffff880011ec1300
RPC:  5321 using AUTH_UNIX cred ffff880011ec1300 to wrap rpc data
RPC:  5321 xprt_transmit(96)
RPC:       xs_tcp_send_request(96) = 96
RPC:  5321 xmit complete
RPC:  5321 sleep_on(queue "xprt_pending" time 4494861481)
RPC:       xs_tcp_data_ready...
RPC:       xs_tcp_data_recv started
RPC:       reading TCP record fragment of length 112
RPC:       reading XID (4 bytes)
RPC:       reading reply for XID 9e010e74
RPC:       reading CALL/REPLY flag (4 bytes)
RPC:       read reply XID 9e010e74
RPC:  5321 added to queue ffff88012a74d3f8 "xprt_pending"
RPC:  5321 setting alarm for 900 ms
RPC:       XID 9e010e74 read 104 bytes
RPC:       xprt = ffff88012a74d000, tcp_copied = 112, tcp_offset = 112, tcp_reclen = 112
RPC:  5321 xid 9e010e74 complete (112 bytes received)
RPC:  5321 __rpc_wake_up_task (now 4494861481)
RPC:  5321 disabling timer
RPC:  5321 removed from queue ffff88012a74d3f8 "xprt_pending"
RPC:       __rpc_wake_up_task done
RPC:       xs_tcp_data_recv done
RPC:       wake_up_next(ffff88012a74d2e8 "xprt_resend")
RPC:       wake_up_next(ffff88012a74d1d8 "xprt_sending")
RPC:  5321 call_status (status 112)
RPC:  5321 call_decode (status 112)
RPC:  5321 validating UNIX cred ffff880011ec1300
RPC:  5321 using AUTH_UNIX cred ffff880011ec1300 to unwrap rpc data
RPC:  5321 call_decode result 0
RPC:  5321 return 0, status 0
RPC:  5321 release task
RPC:       freeing buffer of size 552 at ffff880129b1d000
RPC:  5321 release request ffff88012ae042d0
RPC:       wake_up_next(ffff88012a74d508 "xprt_backlog")
RPC:       rpc_release_client(ffff88012016e400)
RPC:  5321 freeing task
NFS reply getattr: 0
NFS: nfs_update_inode(0:4c/3247737045 ct=1 info=0x7e7f)
NFS: (0:4c/3247737045) revalidation complete
NFS: nfs_lookup_revalidate(HWD/tmp) is valid
RPC:       looking up Generic cred
NFS: permission(0:4c/3247737045), mask=0x24, res=0
NFS: open dir(HWD/tmp)
RPC:       looking up Generic cred
NFS: readdir(HWD/tmp) starting at cookie 0
NFS: readdir_search_pagecache() searching for offset 0
NFS: find_dirent_page: searching page 0 for target 0
NFS: found cookie 4 at index 0
NFS: find_dirent_page: returns 0
NFS: readdir_search_pagecache: returns 0
NFS: nfs_do_filldir() filling starting @ cookie 4
NFS: dentry_delete(tmp/t.sh, 8)
NFS: dentry_delete(tmp/9040-16697.done, 8)
NFS: dentry_delete(tmp/9040-29371.done, 8)
NFS: nfs_do_filldir() filling ended @ cookie 512; returning = 0
NFS: readdir(HWD/tmp) returns 0
NFS: readdir(HWD/tmp) starting at cookie 5
NFS: readdir_search_pagecache() searching for cookie 512
NFS: find_dirent_page: searching page 0 for target 512
NFS: find_dirent: examining cookie 4
NFS: find_dirent: examining cookie 6
NFS: find_dirent: examining cookie 11
NFS: find_dirent: examining cookie 18
NFS: find_dirent: examining cookie 512
NFS: find_dirent_page: returns -523
NFS: readdir_search_pagecache: returns -523
NFS: readdir(HWD/tmp) returns 0
RPC:       looking up Generic cred
NFS: permission(0:4c/5284877), mask=0x1, res=0
NFS: revalidating (0:4c/3247737045)

900ms matches the probably-silly nfs mount settings we're currently using:

rw,hard,intr,tcp,timeo=9,retrans=3,rsize=8192,wsize=8192

Full kernel log here: http://0x.ca/sim/ref/2.6.36_stuck_nfs/

Simon-
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux