RE: [PATCH v2] scsi: mpt3sas: fix oops in error handlers after shutdown/unload

Sreekanth Reddy <sreekanth.reddy@xxxxxxxxxxxx> · Thu, 15 Feb 2018 11:18:21 +0530

Mauricio,

During the shutdown time, I don't want the outstanding IOs to timeout due to
disabling of interrupts and go the TM path. So I wanted to clear out all the
Outstanding IOs in the shutdown path itself instead of clearing them in TM
path.

But if there are already some TMs are outstanding while initiating the
shutdown operation then your patch looks good to me.

May be can you include my set of changes along with your solution?

Thanks,
Sreekanth

-----Original Message-----
From: Mauricio Faria de Oliveira [mailto:mauricfo@xxxxxxxxxxxxxxxxxx]
Sent: Wednesday, February 14, 2018 9:05 PM
To: Sreekanth Reddy; linux-scsi@xxxxxxxxxxxxxxx; Bart.VanAssche@xxxxxxx
Cc: Sathya Prakash Veerichetty; Chaitra Basappa; Suganath Prabu Subramani;
jejb@xxxxxxxxxxxxxxxxxx; martin.petersen@xxxxxxxxxx;
dougmill@xxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH v2] scsi: mpt3sas: fix oops in error handlers after
shutdown/unload

Hi Sreenkanth,

Thanks for reviewing.

On 02/12/2018 04:28 AM, Sreekanth Reddy wrote:
> Instead of returning the scmd with DID_NO_CONNECT in TM path, can we
> wait for some time (10 seconds) in shutdown/unload path for the
> outstanding commands to complete and even then [if] the scmds are
> outstanding then return all the outstanding scmds with DID_NO_CONNECT
> in the shutdown/unload path itself as shown below,

Apparently there is a problem with that.

That is not enough for TM commands; it's only for SCSI IO commands.

The TM commands use the high-priority queue, and SCSI IO commands use the
SCSI IO queue.  But this patch only waits for and return commands in the
latter:

 > +	mpt3sas_wait_for_commands_to_complete(ioc);
 > +	_scsih_flush_running_cmds(ioc);

They only loop up until 'ioc->scsiio_depth', but TM commands get SMIDs above
that [see mpt3sas_base_free_smid() and mpt3sas_scsih_issue_tm()].

So, there's an exposure for abort/reset/other TM commands. For example, the
SCSI IO commands finish quickly in wait_for_commands_to_complete(), then the
shutdown function continues, disables interrupts, and release
pointers/memory.  Then, an outstanding abort/reset command is finished, its
completion interrupt is not serviced, and the SCSI EH for it kicks
in/continues, and so it escalates up, and hit that oops in host reset.

Is there any problem with the patch that I proposed?

It seems okay to have another check in error path (not hot/performance
path), and that check is already used in many other points in the code, for
the same reasons (exit early before the code attempts to use stuff that
might be released).

Thanks again,

--
Mauricio Faria de Oliveira
IBM Linux Technology Center