Re: Error handling on FC devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 11/19/2012 7:41 AM, Hannes Reinecke wrote:
Hi all,

just when we thought we'd finally nailed the error handling on FC ...
A customer of ours recently hit this really nasty issue:
He had a 'drain' on the SAN, in the sense that the link was still intact, but no commands were coming back from the link.

This caused the FC HBA / driver to not detect a link down, and so the failing command was pushed onto the error handler. Which of course resorted back to HBA reset, but by that time the cluster already had kicked out the machine. And as all machines in the cluster were connected to the same switch this happened to all machines, resulting on a nice cluster shutdown. And a really unhappy customer.

Looking closer multipathing actually managed to detect and switch paths as desired, but as the initial failing command was pushed onto the error handler all applications had to wait for this command to finish before proceeding.

So the following questions:
- Why did the FC HBA not detect a 'link-down' scenario?
  (Incidentally, this happens with QLogic _and_ Emulex :-)
  I know this is not a typical link-down, but from my naive
  assumption the HBA should detect that commands are not
  making progress, and at least after RA TOV was expired
  it should try to reset the link.

Link up/down is almost always the state of the physical link - TX signal is active, and on RX side, we have negotiated speed and acquired sync, and are seeing valid characters. It has nothing to do with the packet transmission on the link which is a different story. There is, within the FC std, tracking of credits vs the link, which could reset it (although, it's reset, and yours may be a different definition). So as long as the other end kept it's link up, and we saw valid characters - the link is fine.

From the SCSI perspective - there are no requirements about how long a command takes (consider format commands - which could take hours between the cmd and the response). There is no definition about "making progress" that can be enforced. We have the i/o timers - which usually have defaults of 30s/60s/90s by default. R_A_TOV (10s) is too short vs these - especially when considering some transparent failover arrays (2 pieces of hardware, both on the link, but only 1 responding - and after one fails, the other takes over the others personality, taking about 90s to do so, and resuming the i/os from the new hardware; during this entire time there may be no traffic, for much of this window and it's still "good"). Additionally, there is no requirement that all targets be in use at all times - you could come up with a situation where 1 target is influencing the link activity decision, thus invoking the link bounce, and disrupting i/o load on other targets that are fine. Low probability, but possible.

In general, lack of activity is a good indicator, but that's it, only an indicator. Not great for a hard policy "choice". Also, you're asking low-level designs to now do something new (time inter-i/o gaps, and aggregate gaps), which they may not be prepared to do so.

- Can we speed up error handling for these cases?
  Currently we're waiting for eh to complete before returning
  the affected commands with a final state.
  However, after we've done a LUN reset there shouldn't be
  any command state left and we should be able to terminate
  outstanding commands directly, without having to wait for
  eh to finally complete. James?

Theoretically, I agree - the affected command only has to stall long enough to ensure its own cancellation, which could be just the io abort. True, if the abort is not successful, then you still don't know the status, so you have to escalate the type of recovery to try to cancel, etc. I expect, given the limbo state of the i/o on lower eh failures, you do have to wait to ensure it's "cancelled", at least from a generic scsi point of view. You could try to optimize the local system view - where as long as the LLDD ensures it's cancelled, and will protocol-wise ensure no bad side effects, then you could release it earlier in the eh escalation. I don't believe we have a way for the LLDD to give such a notice to the midlayer. Given all the grey areas you touch on, especially across different types of scsi protocols and hardware, it doesn't surprise me we are waiting until we have confirmation of cancellation before continuing.

Given path switching is somewhat separate from the i/o, would it better to send a notification of a path-fail condition as part of the eh, rather than hinging it on the individual i/o. Yes, the i/o is still in limbo and can't be switched to the new path, but other i/o could without incurring the delay.

-- james s


Thanks.

Cheers,

Hannes

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux