On Mon Feb 10, 2025 at 8:41 AM CET, zhang.guanghui@xxxxxxxx wrote: > Hello > > When using the nvme-tcp driver in a storage cluster, the driver may trigger a null pointer causing the host to crash several times. > By analyzing the vmcore, we know the direct cause is that the request->mq_hctx was used after free. > > CPU1 CPU2 > > nvme_tcp_poll nvme_tcp_try_send --failed to send reqrest 13 > > nvme_tcp_try_recv nvme_tcp_fail_request > > nvme_tcp_recv_skb nvme_tcp_end_request > > nvme_tcp_recv_pdu nvme_complete_rq > > nvme_tcp_handle_comp nvme_retry_req -- request->mq_hctx have been freed, is NULL. > nvme_tcp_process_nvme_cqe > > nvme_complete_rq > > nvme_end_req > > blk_mq_end_request Taking a step back. Let's take a different approach and try to avoid the double completion. The problem here is that apparently we received a nvme_tcp_rsp capsule from the target, meaning that the command has been processed (I guess the capsule has an error status?) So maybe only part of the command has been sent? Why we receive the rsp capsule at all? Shouldn't this be treated as a fatal error by the controller? Maurizio