Re: [bug report] task hang while testing xfstests generic/323

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2019/3/1 上午7:56, Trond Myklebust wrote:
> On Thu, 2019-02-28 at 17:26 -0500, Olga Kornievskaia wrote:
>> On Thu, Feb 28, 2019 at 5:11 AM Jiufei Xue <
>> jiufei.xue@xxxxxxxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> when I tested xfstests/generic/323 with NFSv4.1 and v4.2, the task
>>> changed to zombie occasionally while a thread is hanging with the
>>> following stack:
>>>
>>> [<0>] rpc_wait_bit_killable+0x1e/0xa0 [sunrpc]
>>> [<0>] nfs4_do_close+0x21b/0x2c0 [nfsv4]
>>> [<0>] __put_nfs_open_context+0xa2/0x110 [nfs]
>>> [<0>] nfs_file_release+0x35/0x50 [nfs]
>>> [<0>] __fput+0xa2/0x1c0
>>> [<0>] task_work_run+0x82/0xa0
>>> [<0>] do_exit+0x2ac/0xc20
>>> [<0>] do_group_exit+0x39/0xa0
>>> [<0>] get_signal+0x1ce/0x5d0
>>> [<0>] do_signal+0x36/0x620
>>> [<0>] exit_to_usermode_loop+0x5e/0xc2
>>> [<0>] do_syscall_64+0x16c/0x190
>>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [<0>] 0xffffffffffffffff
>>>
>>> Since commit 12f275cdd163(NFSv4: Retry CLOSE and DELEGRETURN on
>>> NFS4ERR_OLD_STATEID), the client will retry to close the file when
>>> stateid generation number in client is lower than server.
>>>
>>> The original intention of this commit is retrying the operation
>>> while
>>> racing with an OPEN. However, in this case the stateid generation
>>> remains
>>> mismatch forever.
>>>
>>> Any suggestions?
>>
>> Can you include a network trace of the failure? Is it possible that
>> the server has crashed on reply to the close and that's why the task
>> is hung? What server are you testing against?
>>
>> I have seen trace where close would get ERR_OLD_STATEID and would
>> still retry with the same open state until it got a reply to the OPEN
>> which changed the state and when the client received reply to that,
>> it'll retry the CLOSE with the updated stateid.
> 
> I agree with Olga's assessment. The server is not allowed to randomly
> change the values of the seqid, and the client should be taking pains
> to replay any OPEN calls for which a reply is missed. The expectation
> is therefore that NFS4ERR_OLD_STATEID should always be a temporary
> state.
> 
The server bumped the seqid because of a new OPEN from another thread.
And I doubt that maybe the new OPEN task exit while receiving a signal
without update the stateid.

> If it is not, then the bugreport needs to explain why the server bumped
> the seqid without informing the client.
> 



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux