Hung task panic as part of NFS RDMA Disconnect due to possible bug on 6.2.0-34-generic client

"Sukruth Sridharan (he/him)" <susridharan@xxxxxxxxxxxxxxx> · Thu, 30 Nov 2023 10:52:59 +0530

I notice the following hung task panic on 6.2.0-34 kernel during RDMA disconnect

[Wed Nov  1 08:03:54 2023] INFO: task kworker/u16:5:2274646 blocked
for more than 120 seconds.
[Wed Nov  1 08:03:55 2023]       Tainted: G        W  OE
6.2.0-34-generic #34-Ubuntu
[Wed Nov  1 08:03:55 2023] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Wed Nov  1 08:03:55 2023] task:kworker/u16:5   state:D stack:0
pid:2274646 ppid:2      flags:0x00004000
[Wed Nov  1 08:03:55 2023] Workqueue: xprtiod xprt_autoclose [sunrpc]
[Wed Nov  1 08:03:55 2023] Call Trace:
[Wed Nov  1 08:03:55 2023]  <TASK>
[Wed Nov  1 08:03:55 2023]  __schedule+0x2aa/0x610
[Wed Nov  1 08:03:55 2023]  schedule+0x63/0x110
[Wed Nov  1 08:03:55 2023]  schedule_timeout+0x157/0x170
[Wed Nov  1 08:03:55 2023]  wait_for_completion+0x88/0x150
[Wed Nov  1 08:03:55 2023]  rpcrdma_xprt_disconnect+0x33f/0x350 [rpcrdma]
[Wed Nov  1 08:03:55 2023]  xprt_rdma_close+0x12/0x40 [rpcrdma]
[Wed Nov  1 08:03:55 2023]  xprt_autoclose+0x5c/0x120 [sunrpc]
[Wed Nov  1 08:03:55 2023]  process_one_work+0x225/0x430
[Wed Nov  1 08:03:55 2023]  worker_thread+0x50/0x3e0
[Wed Nov  1 08:03:55 2023]  ? __pfx_worker_thread+0x10/0x10
[Wed Nov  1 08:03:55 2023]  kthread+0xe9/0x110
[Wed Nov  1 08:03:55 2023]  ? __pfx_kthread+0x10/0x10
[Wed Nov  1 08:03:55 2023]  ret_from_fork+0x2c/0x50
[Wed Nov  1 08:03:55 2023]  </TASK>

The flow which induced the bug is as follows:
1. Client initiates connection
2. Server hands off the response to the first RPC on the connection to
the NIC (Mellanox ConnectX-5)
3. NIC tries to send the response around 6 times and fails the response with RNR
4. Client issues disconnect (possibly because it didn't receive a response)
5. Server cleans up the connection state
6. Client runs into the above panic as part of disconnect while draining the IOs

It looks like re_receiving is set only in rpcrdma_post_recvs, and the
reason why it wouldn't be reset is if memory-region allocation code
fails.
That is possible if disconnect on the client somehow blocks allocation.

void rpcrdma_post_recvs(struct rpcrdma_xprt *r_xprt, int needed, bool temp)
{
        // ... (some initialization code)

    if (atomic_inc_return(&ep->re_receiving) > 1)
        goto out;

        // ... (some allocation code)

    if (!wr) // <<<<<<<<<<<<<<<<<< PROBLEM HERE >>>>>>>>>>>>>>>>>>>
        goto out;

        // ... (post recv code, and some error handling)

    if (atomic_dec_return(&ep->re_receiving) > 0)
        complete(&ep->re_done);

out:
    trace_xprtrdma_post_recvs(r_xprt, count);
    ep->re_receive_count += count;
    return;
}

static void rpcrdma_xprt_drain(struct rpcrdma_xprt *r_xprt)
{
    struct rpcrdma_ep *ep = r_xprt->rx_ep;
    struct rdma_cm_id *id = ep->re_id;

    /* Wait for rpcrdma_post_recvs() to leave its critical
     * section.
     */
    if (atomic_inc_return(&ep->re_receiving) > 1) //
<<<<<<<<<<<<<<<<<<< This is not reset, so wait gets stuck
>>>>>>>>>>>>>>>>>
        wait_for_completion(&ep->re_done);

    /* Flush Receives, then wait for deferred Reply work
     * to complete.
     */
    ib_drain_rq(id->qp);

    /* Deferred Reply processing might have scheduled
     * local invalidations.
     */
    ib_drain_sq(id->qp);

    rpcrdma_ep_put(ep);
}

Can you help conclude if the above theory around the bug being in the
client code is right? If not, can you help with steps/data points
required to debug this further?

Thanks in advance.
Sukruth