On 4/21/05, Lars Marowsky-Bree <lmb@xxxxxxx> wrote: > On 2005-04-21T23:33:57, Andreas Herrmann <aherrman@xxxxxxxxxx> wrote: > > > Well, there are various situations when all paths to the ESS are > > "temporarily unavailable". In some cases TASK_SET_FULL/BUSY is > > reported as it should be. > > Not sure whether this sense data is decoded and handled correctly in > dm-mpath yet. I don't have detailed specs, nor a feature request to > allocate time to work on making sure it really does. I recommend that > someone at IBM takes the real specs for the ESS and makes sure that it > all works, by a combination of the right defaults in the multipath-tools > hwtable and, if need be, a dm-ess plugin to handle this. > > This would be much appreciated. > Please correct me if my assumption is wrong, but I would think that transient errors are expected, especially in a SAN, from both the fabric and media. A storage device may have to return retryable status conditions at certain points, and that such retryable conditions are not necessarily specific to a storage device. For example, a QUEUE_FULL or BUSY, implying that the device is congested. Wouldn't most storage devices reasonably expect I/O failed due to this condition will be retried? [Such a congestion handling mechanism, I would think, would not have to be storage-specific, although the policy for handling congestion might be?] So in order to deal with transient conditions given that failfast flag is set, the queue_if_no_path must be used; I'm not sure why any dm-multipath storage users would not want to turn on queue_if_no_path by default? As far as I know, ESS does not require any special handing of special sense information, besides various sense data status conditions that it expects would be retried. (Arent' data underruns also an expected retryable condition?). I'm not so familiar with all the various possible transport and media errors/conditions, but I would think that most could/would want to be handled generically by storage devices (which is why the scsi core has generic error handling i'd imagine). But I agree that more testing should be done with ESS and its spec to verify that a special dm-ess error handler is actually not needed. And at the least, a hw entry should be added to dm to turn on queue_if_no_path by default for ESS, and any other necessary defaults. Although, it seems need to add to multipath-tools the ability to set a timeout limit on how long an I/O is queued and retried (otherwise in a permanent failure, I think the I/O could be queued for a quite awhile, e.g. until system runs out of memory). Also, what do you think about allowing a configurable threshold on I/O failures in dm-multipath before deciding to set a path dead; 1 is kinda low, and has no tolerance at all for transient errors. I think it will lessen the dependency on waiting for multipath-tools to reinstate a path that has been set dead due to a transient condition. Thanks! Lan - : send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html