On Friday 14 December 2007 15:35:01 James Bottomley wrote: > > > This is some type of ioc internal error. What we do on DID_SOFT_ERROR > > > is retry for the usual number of times up to the timeout limit. > > > Unfortunately, the retries are fixed at SD_MAX_RETRIES in sd.c. > > > Without diagnosing what's going wrong in the fusion, it's impossible to > > > say if this is reasonable, but your fusion is signalling ioc errors > > > (firmware errors). > > > > besides this seems to be a fusion driver or firmware problem, I still > > think eh is not activated for this error. I'm not absulutely sure, but I > > think with my patch deh and later on eh would be triggered, wouldn't it? > > the full eh machinery, by design, isn't activated for a simple retry. > If you look in scsi_lib.c:scsi_softirq_done() you'll see the processing > of the outcome of scsi_decide_disposision() (DID_SOFT_ERROR comes out of > here with NEEDS_RETRY, providing there are retries left). Right at the > moment, this means that the retry is absolutely immediate, so you > probably run through all of the retries before firmware recovery even > has time to activate. I'd be amenable to giving it an ADD_TO_MLQUEUE > type return (provided it still increments retries) which will cause a > pause in the resubmission (until either a command returns or io pressure > builds up in the block layer). Isn't there always i/o pressure if the scsi bus is satturated? Can we activate eh machinery when retries is exceeded? Index: linux-2.6.22/drivers/scsi/scsi_error.c =================================================================== --- linux-2.6.22.orig/drivers/scsi/scsi_error.c 2007-12-14 15:53:48.000000000 +0100 +++ linux-2.6.22/drivers/scsi/scsi_error.c 2007-12-14 15:58:27.000000000 +0100 @@ -1235,7 +1235,7 @@ int scsi_decide_disposition(struct scsi_ * and not get stuck in a loop. */ case DID_SOFT_ERROR: - goto maybe_retry; + goto maybe_requeue; case DID_IMM_RETRY: return NEEDS_RETRY; @@ -1342,6 +1342,24 @@ int scsi_decide_disposition(struct scsi_ */ return SUCCESS; } + + maybe_requeue: + + /* we requeue for retry because the error was retryable, and + * the request was not marked fast fail. Note that above, + * even if the request is marked fast fail, we still requeue + * for queue congestion conditions (QUEUE_FULL or BUSY) */ + if ((++scmd->retries) <= scmd->allowed + && !blk_noretry_request(scmd->request)) { + return ADD_TO_MLQUEUE; + } else { + /* + * no more retries - report this one back to upper level. + * + * TODO: initiate full error recovery now? + */ + return SUCCESS; + } } /** > > > > > Full log attached. > > > > > > > > > immediate error with no eh intervention because it means that the > > > > > target went away. Handling this as a retryable error isn't an > > > > > option because it will interfere with hotplug. > > > > > > > > Then we need a sysfs flag one can set to manually enable eh for these > > > > devices on DID_NO_CONNECT. > > > > > > No, because that will seriously damage a lot of other systems. > > > > How would it, if we create a device specific sysfs parameter defaulting > > to off? If you think users could activate it by accident, we could also > > print a big warning when the paramter is read from userspace. > > Furthermore, as far as I did understand you, DID_NO_CONNECT is only > > required for hotplugging. But real scsi doesn't do automatic hotplugging, > > does it? > > Yes, it does. Most modern busses are hot plug aware and use > DID_NO_CONNECT to signal target went away. Even some SPI frames are > quasi hotplug aware. > > > One > > always needs to do it manually, e.g. with scsiadd or similar tools. So is > > DID_NO_CONNECT really required for native scsi? If not, we also could > > make the scsi-drivers to set a flag to activate eh on DID_NO_CONNECT. > > Just grep through the mid layer ... you'll see we use DID_NO_CONNECT on > a host of other error conditions to force an immediate error as well. I will do later on. I will also write a patch allowing error recovery for manually overridden devices. [...] > > > This looks like a genuine bug. I missed the thread, since my email > > > system went off line while I was on holiday for two weeks. The > > > symptoms look to be lost commands, but I can't see why from the traces. > > > There's a known bug where we can hang in domain validation because of > > > a resource starvation issue, but I know of none where everything hangs > > > just after error recovery completes. > > > > Since still not much happend to solve this bug, shall I create a bugzilla > > entry? > > Sure ... on further analysis, it is the fusion DV resource starvation > issue. The email thread is here: > > http://marc.info/?t=118039577800004 Interesting thread, I don't understand the details yet, but I'm really curious if this can somehow also explain the *almost deadlock* we are seeing when we do md-resync at maximum device speed. Thanks a lot for your help, Bernd -- Bernd Schubert Q-Leap Networks GmbH - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html