Error handling on FC devices

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

just when we thought we'd finally nailed the error handling on FC ...
A customer of ours recently hit this really nasty issue:
He had a 'drain' on the SAN, in the sense that the link was still intact, but no commands were coming back from the link.

This caused the FC HBA / driver to not detect a link down, and so the failing command was pushed onto the error handler. Which of course resorted back to HBA reset, but by that time the cluster already had kicked out the machine. And as all machines in the cluster were connected to the same switch this happened to all machines, resulting on a nice cluster shutdown. And a really unhappy customer.

Looking closer multipathing actually managed to detect and switch paths as desired, but as the initial failing command was pushed onto the error handler all applications had to wait for this command to finish before proceeding.

So the following questions:
- Why did the FC HBA not detect a 'link-down' scenario?
  (Incidentally, this happens with QLogic _and_ Emulex :-)
  I know this is not a typical link-down, but from my naive
  assumption the HBA should detect that commands are not
  making progress, and at least after RA TOV was expired
  it should try to reset the link.
- Can we speed up error handling for these cases?
  Currently we're waiting for eh to complete before returning
  the affected commands with a final state.
  However, after we've done a LUN reset there shouldn't be
  any command state left and we should be able to terminate
  outstanding commands directly, without having to wait for
  eh to finally complete. James?

Thanks.

Cheers,

Hannes
--
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]
  Powered by Linux