On Tue, 2018-10-16 at 11:23 -0400, Bill Kuzeja wrote: > When doing a surprise removal of an adapter, some in flight I/Os can > get > stuck and take a while to complete (they actually timeout and are > retried). We are not handling an early error exit from > qla2xxx_eh_abort properly. > > Fixes: 45235022da99 ("scsi: qla2xxx: Fix driver unload by shutting > down chip") > Signed-off-by: Bill Kuzeja <william.kuzeja@xxxxxxxxxxx> > --- > > (Originally sent on September 26th, no reply so resending.) > > After a hot remove of a Qlogic adapter, the driver's remove function > gets > called and we end up aborting all in progress I/Os. Here is the code > flow: > > qla2x00_remove_one > qla2x00_abort_isp_cleanup > qla2x00_abort_all_cmds > __qla2x00_abort_all_cmds > qla2xxx_eh_abort > > At the start of qla2xxx_eh_abort, some sanity checks are done before > actually sending the abort. One of these checks is a call to > fc_block_scsi_eh. In the case of a hot remove, it turns out that > this > routine can exit with FAST_IO_FAIL. > > When this occurs, we return back to __qla2x00_abort_all_cmds with an > extra reference on sp (because the abort never gets sent). > Originally, > this was addressed with another fix: > > commit 4cd3b6ebff85 scsi: qla2xxx: Fix extraneous ref on sp's after > adapter break > > But this later this added change complicated matters: > > commit 45235022da99 scsi: qla2xxx: Fix driver unload by shutting down > chip > > Because the abort is now being done earlier in the teardown (through > qla2x00_abort_isp_cleanup), in qla2xxx_eh_abort we make it past > the first check because qla2x00_isp_reg_stat(ha) returns zero. When > we > fail a few lines later in fc_block_scsi_eh, this error is not handled > properly in __qla2x00_abort_all_cmds and the I/O ends up hanging and > timing out because of the extra reference. > > For this fix, a check for FAST_IO_FAIL is added to > __qla2x00_abort_all_cmds where we check to see if qla2xxx_eh_abort > succeeded or not. > > This removes the extra reference in this additional early exit case. > In > my testing (hw surprise removals and also adapter remove via sysfs), > this > eliminates the timeouts and delays and the remove proceeds smoothly. > > --- > drivers/scsi/qla2xxx/qla_os.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_os.c > b/drivers/scsi/qla2xxx/qla_os.c > index 42b8f0d..3ba3765 100644 > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -1771,8 +1771,9 @@ uint32_t qla2x00_isp_reg_stat(struct > qla_hw_data *ha) > * if immediate exit from > * ql2xxx_eh_abort > */ > - if (status == FAILED && > - (qla2x00_isp_reg_stat(ha > ))) > + if (((status == FAILED) && > + (qla2x00_isp_reg_stat(ha > ))) || > + (status == > FAST_IO_FAIL)) > atomic_dec( > &sp->ref_count); > } I recently bumped into this issue. The fix looks correct to me. Reviewed-by Laurence Oberman <loberman@xxxxxxxxxx>