> On Nov 30, 2023, at 12:22 AM, Sukruth Sridharan (he/him) <susridharan@xxxxxxxxxxxxxxx> wrote: > > I notice the following hung task panic on 6.2.0-34 kernel during RDMA disconnect > > [Wed Nov 1 08:03:54 2023] INFO: task kworker/u16:5:2274646 blocked > for more than 120 seconds. > [Wed Nov 1 08:03:55 2023] Tainted: G W OE > 6.2.0-34-generic #34-Ubuntu > [Wed Nov 1 08:03:55 2023] "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [Wed Nov 1 08:03:55 2023] task:kworker/u16:5 state:D stack:0 > pid:2274646 ppid:2 flags:0x00004000 > [Wed Nov 1 08:03:55 2023] Workqueue: xprtiod xprt_autoclose [sunrpc] > [Wed Nov 1 08:03:55 2023] Call Trace: > [Wed Nov 1 08:03:55 2023] <TASK> > [Wed Nov 1 08:03:55 2023] __schedule+0x2aa/0x610 > [Wed Nov 1 08:03:55 2023] schedule+0x63/0x110 > [Wed Nov 1 08:03:55 2023] schedule_timeout+0x157/0x170 > [Wed Nov 1 08:03:55 2023] wait_for_completion+0x88/0x150 > [Wed Nov 1 08:03:55 2023] rpcrdma_xprt_disconnect+0x33f/0x350 [rpcrdma] > [Wed Nov 1 08:03:55 2023] xprt_rdma_close+0x12/0x40 [rpcrdma] > [Wed Nov 1 08:03:55 2023] xprt_autoclose+0x5c/0x120 [sunrpc] > [Wed Nov 1 08:03:55 2023] process_one_work+0x225/0x430 > [Wed Nov 1 08:03:55 2023] worker_thread+0x50/0x3e0 > [Wed Nov 1 08:03:55 2023] ? __pfx_worker_thread+0x10/0x10 > [Wed Nov 1 08:03:55 2023] kthread+0xe9/0x110 > [Wed Nov 1 08:03:55 2023] ? __pfx_kthread+0x10/0x10 > [Wed Nov 1 08:03:55 2023] ret_from_fork+0x2c/0x50 > [Wed Nov 1 08:03:55 2023] </TASK> Hi Sukruth - This isn't a panic, fortunately. It's simply a report that the task is not making progress. More below... > The flow which induced the bug is as follows: > 1. Client initiates connection > 2. Server hands off the response to the first RPC on the connection to > the NIC (Mellanox ConnectX-5) > 3. NIC tries to send the response around 6 times and fails the response with RNR > 4. Client issues disconnect (possibly because it didn't receive a response) > 5. Server cleans up the connection state > 6. Client runs into the above panic as part of disconnect while draining the IOs > > It looks like re_receiving is set only in rpcrdma_post_recvs, and the > reason why it wouldn't be reset is if memory-region allocation code > fails. > That is possible if disconnect on the client somehow blocks allocation. > > void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed, bool temp) > { > // ... (some initialization code) > > if (atomic_inc_return(&ep->re_receiving) > 1) > goto out; > > // ... (some allocation code) > > if (!wr) // <<<<<<<<<<<<<<<<<< PROBLEM HERE >>>>>>>>>>>>>>>>>>> > goto out; > > // ... (post recv code, and some error handling) > > if (atomic_dec_return(&ep->re_receiving) > 0) > complete(&ep->re_done); > > out: > trace_xprtrdma_post_recvs(r_xprt, count); > ep->re_receive_count += count; > return; > } > > static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt) > { > struct rpcrdma_ep *ep = r_xprt->rx_ep; > struct rdma_cm_id *id = ep->re_id; > > /* Wait for rpcrdma_post_recvs() to leave its critical > * section. > */ > if (atomic_inc_return(&ep->re_receiving) > 1) // > <<<<<<<<<<<<<<<<<<< This is not reset, so wait gets stuck >>>>>>>>>>>>>>>>>> > wait_for_completion(&ep->re_done); > > /* Flush Receives, then wait for deferred Reply work > * to complete. > */ > ib_drain_rq(id->qp); > > /* Deferred Reply processing might have scheduled > * local invalidations. > */ > ib_drain_sq(id->qp); > > rpcrdma_ep_put(ep); > } > > Can you help conclude if the above theory around the bug being in the > client code is right? If not, can you help with steps/data points > required to debug this further? Try applying: 895cedc17919 ("xprtrdma: Remap Receive buffers after a reconnect") -- Chuck Lever