Re: [PATCHv2 0/7] Limit overall SCSI EH runtime

Ewan Milne <emilne@xxxxxxxxxx> · Fri, 12 Jul 2013 09:30:40 -0400

On Fri, 2013-07-12 at 13:54 +0800, Ren Mingxin wrote:
> Hi, Ewan:
> 
> I'm wondering how do you test, with a special hardware or self-made
> module?Would you mind pasting your test method() and result?

Hi Rex-

This was tested in a SAN environment with an EMC Symmetrix and
Brocade FC switches.  The error was injected by the following
commands:

portcfg rscnsupr <port> --enable
portdisable <port>

Where <port> is the FC port of the Symmetrix target.

Multipath is used and the test records how long I/O from userspace
takes to complete after the error handling stops and the I/O is
retried on another path.

What happens is that the target never responds to anything the
HBA sends, so commands and TMFs just timeout.  The HBA doesn't
see link down (since it is the target port) and doesn't get an
RSCN.  When the HBA is finally reset, however, it can't login
to the target port and so further I/O gets an immediate error.

Unfortunately, not all SAN environments will exhibit the failing
behavior -- it appears as if in some cases the HBA detects the
problem regardless of the switch portcfg setting.  But this has
been verified to solve the problem of seemingly endless EH
activity in testing at a large customer site.

Also, to be clear, we tested with the "Limit overall SCSI EH
runtime" patchset but not the "New EH command timeout handler".
I think the changes to issue the abort in the timeout handler
are a good idea, though, because there really is no need to
wait for all activity on the host to cease before issuing the
abort as far as I can see.

-Ewan

> 
> Thanks,
> Ren
> 
> >
> > Acked-by: Ewan D. Milne<emilne@xxxxxxxxxx>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html