On Thu, Nov 30, 2023 at 1:06 PM Bagas Sanjaya <bagasdotme@xxxxxxxxx> wrote: > > On Thu, Nov 30, 2023 at 10:52:59AM +0530, Sukruth Sridharan (he/him) wrote: > > I notice the following hung task panic on 6.2.0-34 kernel during RDMA disconnect > > > > [Wed Nov 1 08:03:54 2023] INFO: task kworker/u16:5:2274646 blocked > > for more than 120 seconds. > > [Wed Nov 1 08:03:55 2023] Tainted: G W OE > > 6.2.0-34-generic #34-Ubuntu > > [Wed Nov 1 08:03:55 2023] "echo 0 > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [Wed Nov 1 08:03:55 2023] task:kworker/u16:5 state:D stack:0 > > pid:2274646 ppid:2 flags:0x00004000 > > [Wed Nov 1 08:03:55 2023] Workqueue: xprtiod xprt_autoclose [sunrpc] > > [Wed Nov 1 08:03:55 2023] Call Trace: > > [Wed Nov 1 08:03:55 2023] <TASK> > > [Wed Nov 1 08:03:55 2023] __schedule+0x2aa/0x610 > > [Wed Nov 1 08:03:55 2023] schedule+0x63/0x110 > > [Wed Nov 1 08:03:55 2023] schedule_timeout+0x157/0x170 > > [Wed Nov 1 08:03:55 2023] wait_for_completion+0x88/0x150 > > [Wed Nov 1 08:03:55 2023] rpcrdma_xprt_disconnect+0x33f/0x350 [rpcrdma] > > [Wed Nov 1 08:03:55 2023] xprt_rdma_close+0x12/0x40 [rpcrdma] > > [Wed Nov 1 08:03:55 2023] xprt_autoclose+0x5c/0x120 [sunrpc] > > [Wed Nov 1 08:03:55 2023] process_one_work+0x225/0x430 > > [Wed Nov 1 08:03:55 2023] worker_thread+0x50/0x3e0 > > [Wed Nov 1 08:03:55 2023] ? __pfx_worker_thread+0x10/0x10 > > [Wed Nov 1 08:03:55 2023] kthread+0xe9/0x110 > > [Wed Nov 1 08:03:55 2023] ? __pfx_kthread+0x10/0x10 > > [Wed Nov 1 08:03:55 2023] ret_from_fork+0x2c/0x50 > > [Wed Nov 1 08:03:55 2023] </TASK> > > > > The flow which induced the bug is as follows: > > 1. Client initiates connection > > 2. Server hands off the response to the first RPC on the connection to > > the NIC (Mellanox ConnectX-5) > > 3. NIC tries to send the response around 6 times and fails the response with RNR > > 4. Client issues disconnect (possibly because it didn't receive a response) > > 5. Server cleans up the connection state > > 6. Client runs into the above panic as part of disconnect while draining the IOs > > > > It looks like re_receiving is set only in rpcrdma_post_recvs, and the > > reason why it wouldn't be reset is if memory-region allocation code > > fails. > > That is possible if disconnect on the client somehow blocks allocation. > > > > void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed, bool temp) > > { > > // ... (some initialization code) > > > > if (atomic_inc_return(&ep->re_receiving) > 1) > > goto out; > > > > // ... (some allocation code) > > > > if (!wr) // <<<<<<<<<<<<<<<<<< PROBLEM HERE >>>>>>>>>>>>>>>>>>> > > goto out; > > > > // ... (post recv code, and some error handling) > > > > if (atomic_dec_return(&ep->re_receiving) > 0) > > complete(&ep->re_done); > > > > out: > > trace_xprtrdma_post_recvs(r_xprt, count); > > ep->re_receive_count += count; > > return; > > } > > > > static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt) > > { > > struct rpcrdma_ep *ep = r_xprt->rx_ep; > > struct rdma_cm_id *id = ep->re_id; > > > > /* Wait for rpcrdma_post_recvs() to leave its critical > > * section. > > */ > > if (atomic_inc_return(&ep->re_receiving) > 1) // > > <<<<<<<<<<<<<<<<<<< This is not reset, so wait gets stuck > > >>>>>>>>>>>>>>>>> > > wait_for_completion(&ep->re_done); > > > > /* Flush Receives, then wait for deferred Reply work > > * to complete. > > */ > > ib_drain_rq(id->qp); > > > > /* Deferred Reply processing might have scheduled > > * local invalidations. > > */ > > ib_drain_sq(id->qp); > > > > rpcrdma_ep_put(ep); > > } > > > > Can you help conclude if the above theory around the bug being in the > > client code is right? If not, can you help with steps/data points > > required to debug this further? > > > > Can you verify that the bug still occurs with latest vanilla kernel > (currently v6.7-rc3)? > > Thanks. > > -- > An old man doll... just what I always wanted! - Clara The issue has been seen once in the past few weeks. Unfortunately, we're yet to see a repro of the same. We will try to repro it on the latest kernel. Curious if there's any improvements gone in that you suspect would have resolved the issue? (Apologies for the top post earlier) Thanks, Sukruth