Hey Sagi, it hits the empty rsp list path often with your debug patch. I added code to BUG_ON() after 10 times and I have a crash dump I'm looking at. Isn't the rsp list supposed to be sized such that it will never be empty when a new rsp is needed? I wonder if there is a leak.
Doesn't look from my scan..
I do see that during this heavy load, the rdma send queue "full" condition gets hit often: static bool nvmet_rdma_execute_command(struct nvmet_rdma_rsp *rsp) { struct nvmet_rdma_queue *queue = rsp->queue; if (unlikely(atomic_sub_return(1 + rsp->n_rdma, &queue->sq_wr_avail) < 0)) { pr_debug("IB send queue full (needed %d): queue %u cntlid %u\n", 1 + rsp->n_rdma, queue->idx, queue->nvme_sq.ctrl->cntlid); atomic_add(1 + rsp->n_rdma, &queue->sq_wr_avail); return false; } ... So commands are getting added to the wr_wait list: static void nvmet_rdma_handle_command(struct nvmet_rdma_queue *queue, struct nvmet_rdma_rsp *cmd) { ... if (unlikely(!nvmet_rdma_execute_command(cmd))) { spin_lock(&queue->rsp_wr_wait_lock); list_add_tail(&cmd->wait_list, &queue->rsp_wr_wait_list); spin_unlock(&queue->rsp_wr_wait_lock); } ... Perhaps there's some bug in the wr_wait_list processing of deferred commands? I don't see anything though.
I assume this could happen if under heavy load the device send completions are slower than the rate incoming commands arrival (perhaps device and/or sw). Because we post recv before sending the response back, there is a window where host can send us a new command before the send completion arrived, this is why we allocate more. However, I think that nothing prevents that under heavy load the gap is growing until we exhaust 2x rsps. So perhaps this is something we actually need to account for it...