Hi, we have array of 16 SAS disks connected to Adaptec controllers (we have ASC-58300 and on-board AIC-9410W, but this bug occurs on both of them). Controller initializes drives successfully and they seem to work normally, but when we perform some I/O intensive task (such as create md raid5 just over few disks and then reshape it onto all disks), everything seems to work for a while and then it starts emitting log messages that are included later in this mail, kicks out some (or all) disk devices, sometimes it detects them back, sometimes it doesn't. It ocassionally even leads to complete lockup of the machine in question. I apologize in advance if I overlooked something obvious, but I was unable to find any reference to this elsewhere and I was recommended to send it to linux-scsi. As logs are quite big, I only include part of it, you can find full logs on following URLs: http://init.suse.cz/sas-error-s1 (Adaptec AIC-9410W SAS) http://init.suse.cz/sas-error-s2 (Adaptec ASC-58300 SAS) == CUT == ... sas: command 0xffff810260004b00, task 0xffff810266a55780, timed out: EH_NOT_HANDLED sas: command 0xffff810291fda540, task 0xffff81030d7b57c0, timed out: EH_NOT_HANDLED sas: command 0xffff81025368b780, task 0xffff8102e58fccc0, timed out: EH_NOT_HANDLED sas: command 0xffff81029eb536c0, task 0xffff81025d755300, timed out: EH_NOT_HANDLED sas: command 0xffff8102fd8a6b00, task 0xffff8102c245c6c0, timed out: EH_NOT_HANDLED sas: command 0xffff810132015300, task 0xffff81025d755a00, timed out: EH_NOT_HANDLED sas: command 0xffff8100134a29c0, task 0xffff8102c6c6f400, timed out: EH_NOT_HANDLED sas: command 0xffff810140d9b100, task 0xffff8101ac3b0b80, timed out: EH_NOT_HANDLED sas: command 0xffff81020c4f50c0, task 0xffff81001e99a080, timed out: EH_NOT_HANDLED sas: command 0xffff8101f4451680, task 0xffff81012f8dbd40, timed out: EH_NOT_HANDLED sas: command 0xffff810050910140, task 0xffff81026361f4c0, timed out: EH_NOT_HANDLED sas: command 0xffff810266a57600, task 0xffff81014e0c92c0, timed out: EH_NOT_HANDLED sas: command 0xffff810140d9b2c0, task 0xffff81015b2162c0, timed out: EH_NOT_HANDLED sas: command 0xffff81025368b940, task 0xffff81030ce5da00, timed out: EH_NOT_HANDLED sas: command 0xffff8102fc3d1a80, task 0xffff8101d41947c0, timed out: EH_NOT_HANDLED sas: command 0xffff81020c4f5600, task 0xffff810299e789c0, timed out: EH_NOT_HANDLED sas: Enter sas_scsi_recover_host sas: trying to find task 0xffff810266a55780 sas: sas_scsi_find_task: aborting task 0xffff810266a55780 aic94xx: tmf timed out aic94xx: tmf came back aic94xx: task not done, clearing nexus aic94xx: asd_clear_nexus_index: PRE aic94xx: asd_clear_nexus_index: POST aic94xx: asd_clear_nexus_index: clear nexus posted, waiting... aic94xx: asd_clear_nexus_timedout: here aic94xx: came back from clear nexus aic94xx: task not done, clearing nexus aic94xx: asd_clear_nexus_index: PRE aic94xx: asd_clear_nexus_index: POST aic94xx: asd_clear_nexus_index: clear nexus posted, waiting... aic94xx: asd_clear_nexus_tasklet_complete: here aic94xx: asd_clear_nexus_tasklet_complete: opcode: 0x0 aic94xx: came back from clear nexus aic94xx: task 0xffff810266a55780 aborted, res: 0x5 sas: sas_scsi_find_task: querying task 0xffff810266a55780 aic94xx: tmf tasklet complete sas: sas_scsi_find_task: task 0xffff810266a55780 not at LU sas: task 0xffff810266a55780 is not at LU: I_T recover sas: I_T nexus reset for dev 5000c5000647e471 sas: clearing nexus for port:0 aic94xx: asd_clear_nexus_port: PRE aic94xx: asd_clear_nexus_port: POST aic94xx: asd_clear_nexus_port: clear nexus posted, waiting... aic94xx: task 0xffff8101ac3b0b80 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff81001e99a080 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff81012f8dbd40 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff81026361f4c0 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff81014e0c92c0 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff81015b2162c0 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff81030ce5da00 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff8101d41947c0 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff810299e789c0 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: asd_clear_nexus_tasklet_complete: here aic94xx: asd_clear_nexus_tasklet_complete: opcode: 0x23 aic94xx: task 0xffff8102e58fccc0 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff81025d755300 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff81030d7b57c0 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: BUG:sequencer:dl:no ascb?! aic94xx: task 0xffff81025d755a00 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff8102c245c6c0 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: task 0xffff8102c6c6f400 done with opcode 0x23 resp 0x0 stat 0x8d but aborted by upper layer! sas: clear nexus ha aic94xx: asd_clear_nexus_ha: PRE aic94xx: asd_clear_nexus_ha: POST aic94xx: asd_clear_nexus_ha: clear nexus posted, waiting... aic94xx: asd_clear_nexus_tasklet_complete: here aic94xx: asd_clear_nexus_tasklet_complete: opcode: 0x0 sas: clear nexus ha succeeded aic94xx: tmf timed out aic94xx: escb_tasklet_complete: phy3: PRIMITIVE_RECVD aic94xx: phy3: BROADCAST change received:256 aic94xx: escb_tasklet_complete: phy2: PRIMITIVE_RECVD aic94xx: phy2: BROADCAST change received:256 aic94xx: escb_tasklet_complete: phy0: PRIMITIVE_RECVD aic94xx: phy0: BROADCAST change received:256 sas: broadcast received: 9 sas: broadcast received: 9 sas: broadcast received: 9 sas: REVALIDATING DOMAIN on port 0, pid:2343 sas: ex 500304800001c47f phy18 originated BROADCAST(CHANGE) sd 3:0:10:0: [sdn] Synchronizing SCSI cache aic94xx: escb_tasklet_complete: phy3: PRIMITIVE_RECVD aic94xx: phy3: BROADCAST change received:256 aic94xx: escb_tasklet_complete: phy2: PRIMITIVE_RECVD aic94xx: phy2: BROADCAST change received:256 aic94xx: escb_tasklet_complete: phy0: PRIMITIVE_RECVD aic94xx: phy0: BROADCAST change received:256 aic94xx: tmf timed out ... sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sd 3:0:10:0: scsi: Device offlined - not ready after error recovery sas: --- Exit sas_scsi_recover_host sas: command 0xffff8101f4451840, task 0xffff81026361fbc0, timed out: EH_NOT_HANDLED sas: command 0xffff81020c4f5600, task 0xffff810299e78480, timed out: EH_NOT_HANDLED sas: command 0xffff8102fc3d1a80, task 0xffff810299e782c0, timed out: EH_NOT_HANDLED sas: command 0xffff81025368b940, task 0xffff81014e0c9800, timed out: EH_NOT_HANDLED sas: command 0xffff810140d9b2c0, task 0xffff81014e0c9d40, timed out: EH_NOT_HANDLED sas: command 0xffff810266a57600, task 0xffff81014e0c99c0, timed out: EH_NOT_HANDLED sas: command 0xffff810050910140, task 0xffff81014e0c9100, timed out: EH_NOT_HANDLED sas: command 0xffff8101f4451680, task 0xffff81014e0c9b80, timed out: EH_NOT_HANDLED sas: command 0xffff81020c4f50c0, task 0xffff81014e0c9480, timed out: EH_NOT_HANDLED sas: command 0xffff810140d9b100, task 0xffff81014e0c9640, timed out: EH_NOT_HANDLED sas: command 0xffff8100134a29c0, task 0xffff8102e58fc780, timed out: EH_NOT_HANDLED sas: command 0xffff810132015300, task 0xffff8102e58fc080, timed out: EH_NOT_HANDLED sas: command 0xffff8102fd8a6b00, task 0xffff8102e58fc940, timed out: EH_NOT_HANDLED sas: command 0xffff810291fda540, task 0xffff8102e58fc5c0, timed out: EH_NOT_HANDLED sas: command 0xffff81029eb536c0, task 0xffff8102e58fc240, timed out: EH_NOT_HANDLED sas: command 0xffff81025368b780, task 0xffff8101ac3b0d40, timed out: EH_NOT_HANDLED sas: command 0xffff810260004b00, task 0xffff8101ac3b09c0, timed out: EH_NOT_HANDLED sas: Enter sas_scsi_recover_host sas: trying to find task 0xffff81026361fbc0 sas: sas_scsi_find_task: aborting task 0xffff81026361fbc0 aic94xx: task 0xffff81026361fbc0 done with opcode 0x1e resp 0x0 stat 0x8d but aborted by upper layer! aic94xx: tmf tasklet complete aic94xx: tmf came back aic94xx: asd_abort_task: task 0xffff81026361fbc0 done aic94xx: task 0xffff81026361fbc0 aborted, res: 0x0 sas: sas_scsi_find_task: task 0xffff81026361fbc0 is done sas: sas_eh_handle_sas_errors: task 0xffff81026361fbc0 is done ... == CUT == If you need any further information or testing, I will be glad to provide it. I tried several different kernel versions (even some 2.6.24-rc6 git) without any effect. Thanks for your reply -- Jan Sembera Linux Administrator --------------------------------------------------------------------- SUSE LINUX, s. r. o. e-mail: jsembera@xxxxxxx Lihovarská 1060/12 tel: +420 284 028 981 190 00 Praha 9 fax: +420 284 028 951 Czech Republic http://www.suse.cz/ - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html