Enrico Scholz <enrico.scholz@xxxxxxxxxxxxxxxxx> writes: > I am experiencing random NFS client lockups after 1-2 days. > ... > The full log is available at https://pastebin.pl/view/7d0b345b Filtering only RPC messages[1] shows ----- Mar 16 04:48:58.345357: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049) Mar 16 04:48:58.345487: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2 Mar 16 04:48:58.345705: RPC: state 1 conn 0 dead 0 zapped 1 sk_shutdown 0 Mar 16 05:02:40.037783: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1 Mar 16 05:02:40.037921: RPC: xs_close xprt 0000000022aecad1 Mar 16 05:02:40.038076: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:02:40.038211: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:02:40.038293: RPC: xs_error_report client 0000000022aecad1, error=104... Mar 16 05:02:40.038449: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:02:40.039134: RPC: xs_connect scheduled xprt 0000000022aecad1 Mar 16 05:02:40.042991: RPC: xs_bind 0000:0000:0000:0000:0000:0000:0000:0000:883: ok (0) Mar 16 05:02:40.043072: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049) Mar 16 05:02:40.043159: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2 Mar 16 05:02:40.043308: RPC: state 1 conn 0 dead 0 zapped 1 sk_shutdown 0 Mar 16 05:02:40.045883: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1 Mar 16 05:02:40.046163: RPC: xs_close xprt 0000000022aecad1 Mar 16 05:02:40.046410: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:02:40.046625: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:02:40.047428: RPC: xs_connect scheduled xprt 0000000022aecad1 Mar 16 05:02:40.050387: RPC: xs_bind 0000:0000:0000:0000:0000:0000:0000:0000:822: ok (0) Mar 16 05:02:40.050614: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049) Mar 16 05:02:40.050662: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2 Mar 16 05:02:40.050788: RPC: state 1 conn 0 dead 0 zapped 1 sk_shutdown 0 Mar 16 05:02:40.051969: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1 Mar 16 05:02:40.052067: RPC: xs_close xprt 0000000022aecad1 Mar 16 05:02:40.052189: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:02:40.052243: RPC: xs_error_report client 0000000022aecad1, error=32... Mar 16 05:02:40.052367: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:02:40.052503: RPC: state 7 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:02:40.053201: RPC: xs_connect scheduled xprt 0000000022aecad1 Mar 16 05:02:40.055886: RPC: xs_bind 0000:0000:0000:0000:0000:0000:0000:0000:875: ok (0) __A__ 05:02:40.055947: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049) Mar 16 05:02:40.055995: RPC: 0000000022aecad1 connect status 115 connected 0 sock state 2 Mar 16 05:02:40.056125: RPC: state 1 conn 0 dead 0 zapped 1 sk_shutdown 0 Mar 16 05:07:28.326605: RPC: state 8 conn 1 dead 0 zapped 1 sk_shutdown 1 Mar 16 05:07:28.326679: RPC: xs_connect scheduled xprt 0000000022aecad1 __B__ 05:07:28.326978: RPC: worker connecting xprt 0000000022aecad1 via tcp to XXXXX:2001:1022:: (port 2049) Mar 16 05:07:28.327050: RPC: 0000000022aecad1 connect status 0 connected 0 sock state 8 __C__ 05:07:28.327113: RPC: xs_close xprt 0000000022aecad1 Mar 16 05:07:28.327229: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3 Mar 16 05:07:28.327446: RPC: state 9 conn 0 dead 0 zapped 1 sk_shutdown 3 ----- For me, it seems that 'xs_close' at position __C__ belongs to the connection at __A__ but is applied to the newly created connection __B__. Because the close did not happened yet, __B__ operated on the old socket which is in state 8 (CLOSE_WAIT) and connect returned with 0 instead of -EINPROGRESS hence. There will not be generated a new connection in this case. It seems to be some race where 'xs_connect()' is executed before a delayed 'xs_error_handle()'. Both functions are called in reaction of a 'CLOSE_WAIT'. Enrico