On Mon, 2008-02-18 at 22:26 +0800, Keith Hopkins wrote: > Well, that made life interesting.... > but didn't seem to fix anything. > > The behavior is about the same as before, but with more verbose > errors. I failed one member of the raid and had it rebuild as a > test...which hangs for a while and the drive falls off-line. > > Please grab the dmesg output in all its gory glory from here: > http://wiki.hopnet.net/dokuwiki/lib/exe/fetch.php?media=myit:sas:dmesg-20080218-wpatch-fail.txt.gz I had a look through this. Amazingly, in spite of the message spew, up to here: > sas: Enter sas_scsi_recover_host > sas: trying to find task 0xffff81033c3d3d80 > sas: sas_scsi_find_task: aborting task 0xffff81033c3d3d80 > aic94xx: tmf timed out > aic94xx: tmf came back Everything is going normally (the REQ_TASK_ABORT are properly aborted an retried). At this point (around L3449 in the trace) the aborts start failing. Unfortunately, there's a bug in TMF timeout handling in the driver, it leaves the sequencer entry pending, but frees the ascb. If the sequencer ever picks this up it will get very confused, as it does a while down in the trace: > aic94xx: BUG:sequencer:dl:no ascb?! > aic94xx: BUG:sequencer:dl:no ascb?! That's where the sequencer adds an ascb to the done list that we've already freed. From this point on confusion reigns and the error handler eventually offlines the device. I'll see if I can come up with patches to fix this ... or at least mitigate the problems it causes. James - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html