RE: [PATCH] scsi: Allow error handling timeout to be specified

"Elliott, Robert (Server Storage)" <Elliott@xxxxxx> · Mon, 13 May 2013 15:16:03 +0000

> -----Original Message-----
> From: linux-scsi-owner@xxxxxxxxxxxxxxx [mailto:linux-scsi-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Ewan Milne
> Sent: Friday, 10 May, 2013 11:59 AM
> To: Hannes Reinecke
> Cc: Baruch Even; Martin K. Petersen; linux-scsi; michaelc
> Subject: Re: [PATCH] scsi: Allow error handling timeout to be specified
> 
> On Fri, 2013-05-10 at 16:24 +0200, Hannes Reinecke wrote:
> > On 05/10/2013 04:01 PM, Ewan Milne wrote:
> > > On Fri, 2013-05-10 at 16:22 +0300, Baruch Even wrote:
> > >> On Fri, May 10, 2013 at 3:43 PM, Ewan Milne <emilne@xxxxxxxxxx>
> wrote:
> > >>
> > >>
> > >> I would argue that waiting for the eh to timeout before you switch to
> > >> another path is most likely to be wrong. If you did the first pass of
> > >> error recovery (task abort) and that failed the
> > >> path/hba/logical-device is doomed. If you will switch to another path
> > >> it will either work (meaning the path/hba were bad) or not (logical
> > >> device was the culprit).
> > >
> > > It is necessary to either know the disposition of a command or
> > > else wait for a defined amount of time before retrying the command on
> > > another path.  Otherwise you run the risk that the command will
> > > eventually complete on the first path.  So yes, we need to do the abort
> > > (and its timeout).
> > >
> > Strictly speaking that's not true.
> > Yes, we do need to wait for a certain amount of time for the command
> > completion to come in.
> >
> > However, this time is only defined _on the initiator_.
> > The specification does _NOT_ have any fixed timeout values for _any_
> > command. As such it could in theory (and does, if you happen to run
> > against certain arrays under certain conditions) take several
> > minutes to return a completion.

The REPORT SUPPORTED OPERATION CODES command (see SPC-4) 
returns nominal and recommended timeout values for each supported
command.  Similarly, REPORT SUPPORTED TASK MANAGEMENT FUNCTIONS
returns timeouts for task management functions.

Those times are from the device server's perspective, so any fabric 
overhead needs to be added.

Those commands and the command timeout descriptors are optional.
They are proposed to be mandatory in the Base feature set, though.

> Granted.  (e.g. in the case of WRITE SAME, it could be a while before
> the command completes, and retrying it on another path too quickly,
> followed by other WRITE commands could be a disaster).  So the timeout
> used for the original command has to be appropriate for the command.
> Reducing that timeout and issuing an abort / lun reset / target reset
> to try to fail over to another path earlier won't work if the device
> never gets the abort / lun reset / target reset and the command is still
> executing.

One problem with the ABORT TASK and I_T NEXUS RESET task management
functions is they must be sent down the same I_T nexus as the command(s)
that ran into timeouts.  If that I_T nexus is the source of the problem,
then they are likely to timeout as well.

The REMOVE I_T NEXUS command (standardized in March 2012 in 
SPC-4 revision 35) is designed to be sent down a different I_T nexus - the 
failover path.  It ensures that commands on the original I_T nexus won't 
suddenly resume.  That command is optional and still very new in
standards time.

��.n��������+%������w��{.n�����{������ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f