On 8/20/2018 3:47 PM, Sagi Grimberg wrote: > >> Resending in plain text... >> >> ---- >> >> Hey guys, >> >> I'm debugging a nvmet_rdma crash on the linux-4.14.52 stable kernel >> code. Under heavy load, including 80 nvmf devices, after 13 hours of >> running, I see an Oops [1] when the target is processing a new ingress >> nvme command. It crashes in nvmet_req_init() because req->rsp is NULL: >> >> 493 bool nvmet_req_init(struct nvmet_req *req, struct nvmet_cq *cq, >> 494 struct nvmet_sq *sq, struct nvmet_fabrics_ops >> *ops) >> 495 { >> 496 u8 flags = req->cmd->common.flags; >> 497 u16 status; >> 498 >> 499 req->cq = cq; >> 500 req->sq = sq; >> 501 req->ops = ops; >> 502 req->sg = NULL; >> 503 req->sg_cnt = 0; >> 504 req->rsp->status = 0; <-- HERE >> >> The nvme command opcode is nvme_cmd_write. The nvmet_rdma_queue state >> is NVMET_RDMA_Q_LIVE. The nvmet_req looks valid [2]. IE not garbage. >> But it seems very bad that req->rsp is NULL! :) >> >> Any thoughts? I didn't see anything like this in recent nvmf fixes... > > Is it possible that you ran out of rsps and got a corrupted rsp? That is what it looks like. Because the nvmet_rdma_rsp pointer returned from nvmet_rdma_get_rsp() was not within the bounds of the response struct array allocated for that queue. I wasn't sure how that could happen? I would think first_entry() would return NULL if the list was empty, but I guess not. > > How about trying out this patch to add more information: I'll try this out. Thanks Sagi! > -- > diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c > index e7f43d1e1779..890d9c45ca33 100644 > --- a/drivers/nvme/target/rdma.c > +++ b/drivers/nvme/target/rdma.c > @@ -66,6 +66,7 @@ struct nvmet_rdma_rsp { > > struct nvmet_req req; > > + bool allocated; > u8 n_rdma; > u32 flags; > u32 invalidate_rkey; > @@ -174,11 +175,20 @@ nvmet_rdma_get_rsp(struct nvmet_rdma_queue *queue) > unsigned long flags; > > spin_lock_irqsave(&queue->rsps_lock, flags); > - rsp = list_first_entry(&queue->free_rsps, > + rsp = list_first_entry_or_null(&queue->free_rsps, > struct nvmet_rdma_rsp, free_list); > - list_del(&rsp->free_list); > + if (rsp) { > + list_del(&rsp->free_list); > + rsp->allocated = false; > + } > spin_unlock_irqrestore(&queue->rsps_lock, flags); > > + if (!rsp) { > + pr_debug("dynamically allocated rsp\n"); > + rsp = kmalloc(sizeof(*rsp), GFP_KERNEL); > + rsp->allocated = true; > + } > + > return rsp; > } > > @@ -187,6 +197,11 @@ nvmet_rdma_put_rsp(struct nvmet_rdma_rsp *rsp) > { > unsigned long flags; > > + if (rsp->allocated) { > + kfree(rsp); > + return; > + } > + > spin_lock_irqsave(&rsp->queue->rsps_lock, flags); > list_add_tail(&rsp->free_list, &rsp->queue->free_rsps); > spin_unlock_irqrestore(&rsp->queue->rsps_lock, flags); > -- >