Hi Bart, On 6/14/19, 3:24 PM, "Bart Van Assche" <bvanassche@xxxxxxx> wrote: External Email ---------------------------------------------------------------------- On 6/14/19 3:10 PM, Himanshu Madhani wrote: > From: Quinn Tran <qutran@xxxxxxxxxxx> > > This patch uses kref to protect access between fcp_abort > path and nvme command and LS command completion path. > Stack trace below shows the abort path is accessing stale > memory (nvme_private->sp). > > When command kref reaches 0, nvme_private & srb resource will > be disconnected from each other. Any subsequence nvme abort > request will not be able to reference the original srb. > > [ 5631.003998] BUG: unable to handle kernel paging request at 00000010000005d8 > [ 5631.004016] IP: [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx] > [ 5631.004086] Workqueue: events qla_nvme_abort_work [qla2xxx] > [ 5631.004097] RIP: 0010:[<ffffffffc087df92>] [<ffffffffc087df92>] qla_nvme_abort_work+0x22/0x100 [qla2xxx] > [ 5631.004109] Call Trace: > [ 5631.004115] [<ffffffffaa4b8174>] ? pwq_dec_nr_in_flight+0x64/0xb0 > [ 5631.004117] [<ffffffffaa4b9d4f>] process_one_work+0x17f/0x440 > [ 5631.004120] [<ffffffffaa4bade6>] worker_thread+0x126/0x3c0 > > Signed-off-by: Quinn Tran <qutran@xxxxxxxxxxx> > Signed-off-by: Himanshu Madhani <hmadhani@xxxxxxxxxxx> > --- > drivers/scsi/qla2xxx/qla_def.h | 2 + > drivers/scsi/qla2xxx/qla_nvme.c | 164 ++++++++++++++++++++++++++++------------ > drivers/scsi/qla2xxx/qla_nvme.h | 1 + > 3 files changed, 117 insertions(+), 50 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_def.h b/drivers/scsi/qla2xxx/qla_def.h > index 602ed24bb806..85a27ee5d647 100644 > --- a/drivers/scsi/qla2xxx/qla_def.h > +++ b/drivers/scsi/qla2xxx/qla_def.h > @@ -532,6 +532,8 @@ typedef struct srb { > uint8_t cmd_type; > uint8_t pad[3]; > atomic_t ref_count; > + struct kref cmd_kref; /* need to migrate ref_count over to this */ > + void *priv; > wait_queue_head_t nvme_ls_waitq; > struct fc_port *fcport; > struct scsi_qla_host *vha; My feedback about this patch is as follows: - I think that having two atomic counters in struct srb with a very similar purpose is overkill and also that a single atomic counter should be sufficient in this structure. - The problem fixed by this patch not only can happen with NVMe-FC but also with FC. I would prefer to see all code paths fixed for which a race between completion and abort can occur. Yes. We are in process of doing the larger cleanup. However, this patch was part of fixes we verified for a crash and want to get this in a distro before the wider cleanup is submitted for inclusion. Would you consider allowing us to add this patch and we'll provide larger patch fixing all code path in next series. Thanks, Bart.