On 2019/3/1 上午7:56, Trond Myklebust wrote: > On Thu, 2019-02-28 at 17:26 -0500, Olga Kornievskaia wrote: >> On Thu, Feb 28, 2019 at 5:11 AM Jiufei Xue < >> jiufei.xue@xxxxxxxxxxxxxxxxx> wrote: >>> Hi, >>> >>> when I tested xfstests/generic/323 with NFSv4.1 and v4.2, the task >>> changed to zombie occasionally while a thread is hanging with the >>> following stack: >>> >>> [<0>] rpc_wait_bit_killable+0x1e/0xa0 [sunrpc] >>> [<0>] nfs4_do_close+0x21b/0x2c0 [nfsv4] >>> [<0>] __put_nfs_open_context+0xa2/0x110 [nfs] >>> [<0>] nfs_file_release+0x35/0x50 [nfs] >>> [<0>] __fput+0xa2/0x1c0 >>> [<0>] task_work_run+0x82/0xa0 >>> [<0>] do_exit+0x2ac/0xc20 >>> [<0>] do_group_exit+0x39/0xa0 >>> [<0>] get_signal+0x1ce/0x5d0 >>> [<0>] do_signal+0x36/0x620 >>> [<0>] exit_to_usermode_loop+0x5e/0xc2 >>> [<0>] do_syscall_64+0x16c/0x190 >>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> [<0>] 0xffffffffffffffff >>> >>> Since commit 12f275cdd163(NFSv4: Retry CLOSE and DELEGRETURN on >>> NFS4ERR_OLD_STATEID), the client will retry to close the file when >>> stateid generation number in client is lower than server. >>> >>> The original intention of this commit is retrying the operation >>> while >>> racing with an OPEN. However, in this case the stateid generation >>> remains >>> mismatch forever. >>> >>> Any suggestions? >> >> Can you include a network trace of the failure? Is it possible that >> the server has crashed on reply to the close and that's why the task >> is hung? What server are you testing against? >> >> I have seen trace where close would get ERR_OLD_STATEID and would >> still retry with the same open state until it got a reply to the OPEN >> which changed the state and when the client received reply to that, >> it'll retry the CLOSE with the updated stateid. > > I agree with Olga's assessment. The server is not allowed to randomly > change the values of the seqid, and the client should be taking pains > to replay any OPEN calls for which a reply is missed. The expectation > is therefore that NFS4ERR_OLD_STATEID should always be a temporary > state. > The server bumped the seqid because of a new OPEN from another thread. And I doubt that maybe the new OPEN task exit while receiving a signal without update the stateid. > If it is not, then the bugreport needs to explain why the server bumped > the seqid without informing the client. >