Re: Error handling on FC devices

James Smart <James.Smart@xxxxxxxxxx> · Mon, 26 Nov 2012 17:32:10 -0500

On 11/19/2012 7:41 AM, Hannes Reinecke wrote:
Hi all,

just when we thought we'd finally nailed the error handling on FC ...
A customer of ours recently hit this really nasty issue:
He had a 'drain' on the SAN, in the sense that the link was still 
intact, but no commands were coming back from the link.

This caused the FC HBA / driver to not detect a link down, and so the 
failing command was pushed onto the error handler.
Which of course resorted back to HBA reset, but by that time the 
cluster already had kicked out the machine.
And as all machines in the cluster were connected to the same switch 
this happened to all machines, resulting on a nice cluster shutdown. 
And a really unhappy customer.

Looking closer multipathing actually managed to detect and switch 
paths as desired, but as the initial failing command was pushed onto 
the error handler all applications had to wait for this command to 
finish before proceeding.

So the following questions:
- Why did the FC HBA not detect a 'link-down' scenario?
  (Incidentally, this happens with QLogic _and_ Emulex :-)
  I know this is not a typical link-down, but from my naive
  assumption the HBA should detect that commands are not
  making progress, and at least after RA TOV was expired
  it should try to reset the link.

Link up/down is almost always the state of the physical link - TX signal 
is active, and on RX side, we have negotiated speed and acquired sync, 
and are seeing valid characters.  It has nothing to do with the packet 
transmission on the link which is a different story.  There is, within 
the FC std, tracking of credits vs the link, which could reset it 
(although, it's reset, and yours may be a different definition).  So as 
long as the other end kept it's link up, and we saw valid characters - 
the link is fine.

From the SCSI perspective - there are no requirements about how long a 
command takes (consider format commands - which could take hours between 
the cmd and the response). There is no definition about "making 
progress" that can be enforced.  We have the i/o timers - which usually 
have defaults of 30s/60s/90s by default. R_A_TOV (10s) is too short vs 
these - especially when considering some transparent failover arrays (2 
pieces of hardware, both on the link, but only 1 responding - and after 
one fails, the other takes over the others personality, taking about 90s 
to do so, and resuming the i/os from the new hardware; during this 
entire time there may be no traffic, for much of this window and it's 
still "good").  Additionally, there is no requirement that all targets 
be in use at all times - you could come up with a situation where 1 
target is influencing the link activity decision, thus invoking the link 
bounce, and disrupting i/o load on other targets that are fine. Low 
probability, but possible.

In general, lack of activity is a good indicator, but that's it, only an 
indicator. Not great for a hard policy "choice".  Also, you're asking 
low-level designs to now do something new (time inter-i/o gaps, and 
aggregate gaps), which they may not be prepared to do so.

- Can we speed up error handling for these cases?
  Currently we're waiting for eh to complete before returning
  the affected commands with a final state.
  However, after we've done a LUN reset there shouldn't be
  any command state left and we should be able to terminate
  outstanding commands directly, without having to wait for
  eh to finally complete. James?

Theoretically, I agree - the affected command only has to stall long 
enough to ensure its own cancellation, which could be just the io abort. 
True, if the abort is not successful, then you still don't know the 
status, so you have to escalate the type of recovery to try to cancel, 
etc.  I expect, given the limbo state of the i/o on lower eh failures, 
you do have to wait to ensure it's "cancelled", at least from a generic 
scsi point of view.  You could try to optimize the local system view - 
where as long as the LLDD ensures it's cancelled, and will protocol-wise 
ensure no bad side effects, then you could release it earlier in the eh 
escalation. I don't believe we have a way for the LLDD to give such a 
notice to the midlayer. Given all the grey areas you touch on, 
especially across different types of scsi protocols and hardware, it 
doesn't surprise me we are waiting until we have confirmation of 
cancellation before continuing.

Given path switching is somewhat separate from the i/o, would it better 
to send a notification of a path-fail condition as part of the eh, 
rather than hinging it on the individual i/o.  Yes, the i/o is still in 
limbo and can't be switched to the new path, but other i/o could without 
incurring the delay.

-- james s

Thanks.

Cheers,

Hannes

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html