On 13-03-01 10:27 AM, Jeremy Linton wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 3/1/2013 9:06 AM, James Bottomley wrote:
The results were "interesting", there are some really strange things that
happen in some of the LLD error paths. Its obvious that error injection
is not part of testing many of them, and what at first glance should be a
fairly straightforward error can create quite a mess. So anyone sending
any kind of reset (especially without the ESCALATE flag which tends to
isolate the error handling) to the LLD's should be aware that behavior
between them can vary significantly.
So the patch does seem to have dangerous side effects.
Those are due to "bugs" in the LLD's that actually are there regardless of
that patch. For example the lpfc patch I posted a couple days ago, fixes the
LPFC driver so that it actually checks the return status from the task
management IOCB's being sent to the firmware. As it stands the reset paths in
the lpfc driver always return SUCCESS independently of the status of any
aborts, resets, being sent as part of the reset handlers. This is completely
non obvious at first glance at the code.
This means that the error handling behavior of lpfc is significantly
different (and not necessarily better) than the zfcp and qlogic drivers I also
tested.
I didn't find any cases where this patch makes the problem worse, in fact in
general the behavior is significantly better.
My testing of this patch was against scsi_debug and SAS.
It was relatively simple with scsi_debug and did what
was advertised.
SAS was much more difficult with my LSI controllers and an
expander. I was trying to set up a situation where Linux
thought there was a LU present but a phy to it in the expander
was disabled, breaking the path. These days broadcast(change)
is working too well to get away with that. Next attempt was
SAS zoning with two initiators and blind-side one initiator's
path to a LU via SAS zoning functions sent from the other
initiator. That works but when I issued the LU resets
(non-escalating or the existing escalating) strange things
happened in the LSI mptsas (first generation) LLD. I found
myself in a similar situation to Jeremy with his testing:
I'm certain the reset was being issued and failing
but the resulting mess was caused by the mptsas LLD **. I
don't have the time or equipment to delve into that LLD. And I
suspect that that LLD is bypassing mid-level error handling
to do its own.
Mike Christie had doubts about this patch as well but I hope
that I convinced him (via posts to this list) that there
wasn't a problem. All that is happening is that additional,
non-escalating versions of the existing user space reset options
are being added.
The bottom line is that when escalating device (LU) and target
(I_T Nexus) resets are issued on modern transports you can
never be 100% sure that they will get through (e.g. due to
congestion). And escalating that reset to the next level
could cause significant collateral damage.
Doug Gilbert
** And the HBA was never officially sold by LSI (IBM sold it)
so the firmware is pretty old (as in 4 years old).
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html