Hello, On Fri, Sep 02, 2011 at 05:22:38PM +0100, Bruce Stenning wrote: > Unfortunately it has so far been quite difficult to reproduce when specifically > attempting to. In normal use cases I reproduced it twice by unplugging a drive > from a RAID array with redundancy intact. This was out of around a dozen > cycles of waiting until redundancy was restored while the unit was under load, > popping the disk, reinserting, and triggering a RAID rebuild. Hmm... that's unfortunate. > I have only twice managed to trigger a lockup deliberately. In both cases the > tracing showed a scheduled EH which was subsequently not enacted. > > How long could it take for the EH to be enacted? In the lockups that I > have reproduced it did not seem to have recovered minutes later, but perhaps > if I had waited longer...? I have noticed that error recovery sometimes backs > off for 8s and even 33s, but it always warns when that sort of delay is coming > up. It should happen pretty quickly. In such cases, fastdrain is activated and all pending commands are aborted if they complete in 3 secs and then EH should kick in. The backoff is from reset path only to give breathing time for devices which take long time to spin up and doesn't apply in this case. > I shall continue to try to track down why the scheduled EH does not happen. Can you please add some debug printk's to scsi_schedule_eh() and see whether scsi_eh_wakeup() is invoked from there? It seems likely that the problem is caused by race conditions around SHOST_[CANCEL_]RECOVERY flags. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html