On Tue Feb 11, 2025 at 9:04 AM CET, zhang.guanghui@xxxxxxxx wrote: > Hi > > This is a race issue, I can't reproduce it stably yet. I have not tested the latest kernel. but in fact, I've synced some nvme-tcp patches from lastest upstream, Hello, could you try this patch? queue_lock should protect against concurrent "error recovery", while send_mutex should serialize try_recv() and try_send(), emulating the way io_work works. Concurrent calls to try_recv() should already be protected by sock_lock. diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 841238f38fdd..f464de04ff4d 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -2653,16 +2653,24 @@ static int nvme_tcp_poll(struct blk_mq_hw_ctx *hctx, struct io_comp_batch *iob) { struct nvme_tcp_queue *queue = hctx->driver_data; struct sock *sk = queue->sock->sk; + int r = 0; + mutex_lock(&queue->queue_lock); if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags)) - return 0; + goto out; set_bit(NVME_TCP_Q_POLLING, &queue->flags); if (sk_can_busy_loop(sk) && skb_queue_empty_lockless(&sk->sk_receive_queue)) sk_busy_loop(sk, true); + + mutex_lock(&queue->send_mutex); nvme_tcp_try_recv(queue); + r = queue->nr_cqe; + mutex_unlock(&queue->send_mutex); clear_bit(NVME_TCP_Q_POLLING, &queue->flags); - return queue->nr_cqe; +out: + mutex_unlock(&queue->queue_lock); + return r; } static int nvme_tcp_get_address(struct nvme_ctrl *ctrl, char *buf, int size) Thanks, Maurizio