> On May 25, 2017, at 12:26 PM, Bill Kuzeja <william.kuzeja@xxxxxxxxxxx> wrote: > > Hung task timeouts can result if a qlogic board breaks unexpectedly while > running I/O. These tasks become hung because command srb reference counts > are not going to zero, hence the affected srbs and commands do not get > freed. This fix accounts for this extra reference in the srbs in the case > of a board failure. > > Fixes: a465537ad1a4 ("qla2xxx: Disable the adapter and skip error recovery in case of register disconnect") > Signed-off-by: Bill Kuzeja <william.kuzeja@xxxxxxxxxxx> > --- > > I encountered this hang during by adapter break testing (on Stratus > hardware, we need to survive adapter breaks). I noticed that we wait > indefinitely for several commands which all have outstanding > references to them, but which have all followed this code path: > > qla2x00_abort_all_cmds > qla2xxx_eh_abort > Exit early because qla2x00_isp_reg_stat is non-zero. > > Note that before calling qla2xxx_eh_abort from this path, we do an extra > sp_get(sp), which takes out another reference on the current sp. If we > do not early exit immediately from qla2xxx_eh_abort, we'll get rid of this > extra reference through the abort - which is the normal case. > > However, if we exit immediately, this extra sp_get is never dereferenced. > Each one of the commands flowing through this code will be stuck forever, > resulting in hung tasks. > > This early exit in qla2xxx_eh_abort was introduced by this commit: > > commit a465537ad1a4 ("qla2xxx: Disable the adapter and skip error recovery in case of register disconnect.") > > The solution for this is somewhat tricky because qla2xxx_eh_abort can be > invoked from the scsi error handler as well as qla2x00_abort_all_cmds. > I originally thought of having qla2xxx_eh_abort remove the extra reference > before early exiting, but not knowing the context of its invocation, > this seemed dangerous. > > Alternatively we could also just remove the early exit case from > qla2xxx_eh_abort, but the aforementioned commit was put in for a reason, so > that doesn't seem correct either. > > The right thing to do seems to be putting the fix in qla2x00_abort_all_cmds, > checking the conditions we where we have exited early from qla2xxx_eh_abort > before removing the extra reference. > > --- > drivers/scsi/qla2xxx/qla_os.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c > index 1c79579..1a93365 100644 > --- a/drivers/scsi/qla2xxx/qla_os.c > +++ b/drivers/scsi/qla2xxx/qla_os.c > @@ -1632,7 +1632,7 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) > void > qla2x00_abort_all_cmds(scsi_qla_host_t *vha, int res) > { > - int que, cnt; > + int que, cnt, status; > unsigned long flags; > srb_t *sp; > struct qla_hw_data *ha = vha->hw; > @@ -1662,8 +1662,12 @@ uint32_t qla2x00_isp_reg_stat(struct qla_hw_data *ha) > */ > sp_get(sp); > spin_unlock_irqrestore(&ha->hardware_lock, flags); > - qla2xxx_eh_abort(GET_CMD_SP(sp)); > + status = qla2xxx_eh_abort(GET_CMD_SP(sp)); > spin_lock_irqsave(&ha->hardware_lock, flags); > + /* Get rid of extra reference if immediate exit > + * from ql2xxx_eh_abort */ > + if (status == FAILED && (qla2x00_isp_reg_stat(ha))) > + atomic_dec(&sp->ref_count); > } > req->outstanding_cmds[cnt] = NULL; > sp->done(sp, res); > -- > 1.8.3.1 > Looks Good. Acked-by: Himanshu Madhani <himanshu.madhani@xxxxxxxxxx> Thanks, - Himanshu