On 6/14/19 3:10 PM, Himanshu Madhani wrote:
From: Quinn Tran <qutran@xxxxxxxxxxx> This patch uses kref to protect access between fcp_abort path and nvme command and LS command completion path. Stack trace below shows the abort path is accessing stale memory (nvme_private->sp). When command kref reaches 0, nvme_private & srb resource will be disconnected from each other. Any subsequence nvme abort request will not be able to reference the original srb. [ 5631.003998] BUG: unable to handle kernel paging request at 00000010000005d8 [ 5631.004016] IP: [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx] [ 5631.004086] Workqueue: events qla_nvme_abort_work [qla2xxx] [ 5631.004097] RIP: 0010:[<ffffffffc087df92>] [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx] [ 5631.004109] Call Trace: [ 5631.004115] [<ffffffffaa4b8174>] ? pwq_dec_nr_in_flight+0x64/0xb0 [ 5631.004117] [<ffffffffaa4b9d4f>] process_one_work+0x17f/0x440 [ 5631.004120] [<ffffffffaa4bade6>] worker_thread+0x126/0x3c0 Signed-off-by: Quinn Tran <qutran@xxxxxxxxxxx> Signed-off-by: Himanshu Madhani <hmadhani@xxxxxxxxxxx> --- drivers/scsi/qla2xxx/qla_def.h | 2 + drivers/scsi/qla2xxx/qla_nvme.c | 164 ++++++++++++++++++++++++++++------------ drivers/scsi/qla2xxx/qla_nvme.h | 1 + 3 files changed, 117 insertions(+), 50 deletions(-) diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h index 602ed24bb806..85a27ee5d647 100644 --- a/drivers/scsi/qla2xxx/qla_def.h +++ b/drivers/scsi/qla2xxx/qla_def.h @@ -532,6 +532,8 @@ typedef struct srb { uint8_t cmd_type; uint8_t pad[3]; atomic_t ref_count; + struct kref cmd_kref; /* need to migrate ref_count over to this */ + void *priv; wait_queue_head_t nvme_ls_waitq; struct fc_port *fcport; struct scsi_qla_host *vha;
My feedback about this patch is as follows: - I think that having two atomic counters in struct srb with a very similar purpose is overkill and also that a single atomic counter should be sufficient in this structure. - The problem fixed by this patch not only can happen with NVMe-FC but also with FC. I would prefer to see all code paths fixed for which a race between completion and abort can occur. Thanks, Bart.