On Tue, Jun 25, 2013 at 10:29:00PM +0530, Shyam Kaushik wrote: > Hi, > > I looked up at this issue further & see the problem is this: > # Client sends OPEN() of a file & as part of this NFSD server sets up > a stateid & returns to the client. > # Client comes back immediately with OPEN_CONFIRM() presenting the > same stateid & NFSD server replies back that its a bad-stateid. > > Here are some pieces from tcpdump: > NFS client comes with an OPEN of a file > 10.189253 10.0.27.163 172.31.240.116 NFS 326 V4 Call (Reply In 8264) > OPEN DH:0xfe4b0254/index.html > share_access: OPEN4_SHARE_ACCESS_READ (1) > share_deny: OPEN4_SHARE_DENY_NONE (0) > clientid: 0xcb76ab5114000000 > Claim Type: CLAIM_NULL (0) > > NFS server reply > 10.457226 172.31.240.116 10.0.27.163 NFS 498 V4 Reply (Call In 8057) > OPEN StateID:0x50ee > Status: NFS4_OK (0) > StateID Hash: 0x50ee > seqid: 0x00000001 > Data: cb76ab5114000000c4ca2540 > 1. = OPEN4_RESULT_CONFIRM > Delegation Type: OPEN_DELEGATE_NONE (0) > > > Client comes back with OPEN_CONFIRM > 10.459343 10.0.27.163 172.31.240.116 NFS 238 V4 Call (Reply In 8465) > OPEN_CONFIRM > StateID Hash: 0x50ee > seqid: 0x00000001 > Data: cb76ab5114000000c4ca2540 > > Server replies with bad-stateid > 10.733341 172.31.240.116 10.0.27.163 NFS 122 V4 Reply (Call In 8275) > OPEN_CONFIRM Status: NFS4ERR_BAD_STATEID > Status: NFS4ERR_BAD_STATEID (10025) That does look like a server bug. But I have trouble seeing how that's happening (or why you would be able to reproduce it and not me). On the final OPEN_CONFIRM the server is either failing to find the stateid or found it and thinks it's been closed or already confirmed. This is a different case from the one below where the server returns BAD_STATEID on the write. > This keeps happening & nothing really progresses. > > Any thoughts/ideas on how to debug this further greatly appreciated. Off hand all I can suggest is tracing through the code (worst case, sprinkling some printk's around) to figure out exactly where the bad_stateid is being returned from originally. Or running a git bisect if you have a known-good kernel. Or if there's anything particular about your workload that you could tell us so someone else could attempt to reproduce it. Also it's interesting if you can verify whether this is reproduceable on newer kernels. --b. > > Thanks. > > --Shyam > > On Mon, Jun 24, 2013 at 10:43 PM, Shyam Kaushik <shyamnfs1@xxxxxxxxx> wrote: > > Hi Folks, > > > > Need help regarding a strange NFS server issue on 3.2 kernel. > > > > We are running a NFS server on Ubuntu precise with 3.2.0-25-generic > > #40-Ubuntu kernel. > > > > We have several NFS exports out of this server & multiple clients > > running different versions of linux kernel consume these exports. We > > use ext4 with sync mount as the filesystem. > > > > We periodically see that all NFS activity comes to a standstill on all > > NFS exports. Enabling NFS debug shows that there are numerous > > nfserr_bad_stateid on almost all operations. This makes all of the > > NFSD threads to consume all of CPU on the server. > > > > Jun 24 01:50:42 srv007 kernel: [5753609.342457] nfsd_dispatch: vers 4 proc 1 > > Jun 24 01:50:42 srv007 kernel: [5753609.342457] nfsv4 compound op > > #1/7: 22 (OP_PUTFH) > > Jun 24 01:50:42 srv007 kernel: [5753609.342467] nfsv4 compound op > > ffff880095744078 opcnt 3 #1: 22: status 0 > > Jun 24 01:50:42 srv007 kernel: [5753609.342472] nfsv4 compound op > > #2/3: 38 (OP_WRITE) > > Jun 24 01:50:42 srv007 kernel: [5753609.342472] nfsd: fh_verify(36: > > 01070001 00d40001 00000000 ac63c188 0a4859a1 feb41e83) > > Jun 24 01:50:42 srv007 kernel: [5753609.342484] renewing client > > (clientid 51ab76cb/00005fc9) > > Jun 24 01:50:42 srv007 kernel: [5753609.342486] NFSD: nfsd4_write: > > couldn't process stateid! > > Jun 24 01:50:42 srv007 kernel: [5753609.342529] nfsv4 compound op > > ffff880095744078 opcnt 3 #2: 38: status 10025 > > Jun 24 01:50:42 srv007 kernel: [5753609.342544] nfsv4 compound returned 10025 > > > > Jun 24 01:50:42 srv007 kernel: [5753609.444116] nfsd_dispatch: vers 4 proc 1 > > Jun 24 01:50:42 srv007 kernel: [5753609.444122] nfsv4 compound op > > #1/3: 22 (OP_PUTFH) > > Jun 24 01:50:42 srv007 kernel: [5753609.444125] nfsd: fh_verify(36: > > 01070001 00020001 00000000 eb3726ca c8497c28 911b4a8d) > > Jun 24 01:50:42 srv007 kernel: [5753609.444134] nfsv4 compound op > > ffff880093436078 opcnt 3 #1: 22: status 0 > > Jun 24 01:50:42 srv007 kernel: [5753609.444136] nfsv4 compound op > > #2/3: 38 (OP_WRITE) > > Jun 24 01:50:42 srv007 kernel: [5753609.446920] nfsd4_process_open2: > > stateid=(51ab76cb/0000000b/40259544/00000001) > > Jun 24 01:50:42 srv007 kernel: [5753609.446925] nfsv4 compound op > > ffff880095027078 opcnt 7 #3: 18: status 0 > > Jun 24 01:50:42 srv007 kernel: [5753609.446929] renewing client > > (clientid 51ab76cb/00000022) > > Jun 24 01:50:42 srv007 kernel: [5753609.446929] NFSD: nfsd4_write: > > couldn't process stateid! > > Jun 24 01:50:42 srv007 kernel: [5753609.446929] nfsv4 compound op > > ffff880093436078 opcnt 3 #2: 38: status 10025 > > Jun 24 01:50:42 srv007 kernel: [5753609.446929] nfsv4 compound returned 10025 > > > > Jun 24 01:50:42 srv007 kernel: [5753609.447162] nfsd_dispatch: vers 4 proc 1 > > Jun 24 01:50:42 srv007 kernel: [5753609.447163] nfsd: fh_verify(36: > > 01070001 00240001 00000000 a80fc170 1947ae6c 4fbf37b1) > > Jun 24 01:50:42 srv007 kernel: [5753609.447163] NFSD: > > nfs4_preprocess_seqid_op: seqid=1 stateid = > > (51ab76cb/00004b96/40259528/00000001) > > Jun 24 01:50:42 srv007 kernel: [5753609.447181] nfsv4 compound op > > #1/7: 22 (OP_PUTFH) > > Jun 24 01:50:42 srv007 kernel: [5753609.447185] nfsd: fh_verify(28: > > 00070001 00020001 00000000 53c0b8df a948fcb9 475e2cba) > > Jun 24 01:50:42 srv007 kernel: [5753609.447185] renewing client > > (clientid 51ab76cb/00004b96) > > Jun 24 01:50:42 srv007 kernel: [5753609.447187] nfsv4 compound op > > ffff88000813f078 opcnt 2 #2: 20: status 10025 > > Jun 24 01:50:42 srv007 kernel: [5753609.447189] nfsv4 compound returned 10025 > > > > NFSD stacks are like: > > [<ffffffffa022e765>] nfs4_lock_state+0x15/0x40 [nfsd] > > [<ffffffffa02234f4>] nfsd4_open+0xb4/0x440 [nfsd] > > [<ffffffffa0221bc8>] nfsd4_proc_compound+0x518/0x6d0 [nfsd] > > [<ffffffffa020fa0b>] nfsd_dispatch+0xeb/0x230 [nfsd] > > [<ffffffffa0131d95>] svc_process_common+0x345/0x690 [sunrpc] > > [<ffffffffa01321e2>] svc_process+0x102/0x150 [sunrpc] > > [<ffffffffa020f0bd>] nfsd+0xbd/0x160 [nfsd] > > [<ffffffff8108a59c>] kthread+0x8c/0xa0 > > [<ffffffff81667db4>] kernel_thread_helper+0x4/0x10 > > [<ffffffffffffffff>] 0xffffffffffffffff > > > > I couldnt exactly capture the running thread, but it appears that one > > thread of the NFSD thread pool runs & detects a bad-state-id & returns > > back. > > > > Is this a known issue or any help on how to dig in further is greatly > > appreciated. > > > > Thanks. > > > > --Shyam > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html