Le 17 août 2011 10:08, Peter Chang <dpf@xxxxxxxxxx> a écrit : > Le 17 août 2011 07:25, Fredrik Lindgren <fli@xxxxxxxx> a écrit : >> When doing disk IO on the disks (they are all configured in MD raids) >> suddenly IO will >> stop and these messages are printed on the console about once every second: >> >> mpt2sas0: log_info(0x31110610): originator(PL), code(0x11), sub_code(0x0610) >> >> From what I understand this means: >> >> PL_LOGINFO_CODE_RESET (0x00110000) >> PL_LOGINFO_SUB_CODE_SATA_NON_NCQ_RW_ERR_BIT_SET (0x00000600) >> >> So a disk is acting up, generating errors? What does the last "10" mean in >> the sub_code, >> is that an identifier for which disk it is? > > no, the bottom bts are still part of the error code. > > i haven't run w/ your exact fw/driver setup, but i think you'll find > that you're in a 'loop' where the driver is returning DID_RESET and > the scsi layer is retrying w/o going through the retry counter logic > (the command that fails is one that the firmware issued). since someone else gave the error code (i didn't check if i just had some other magic header)... the problem is probably a combination of the disk and controller firmwares. when an NCQ request fails the firmware will do a READ LOG EXT(10) to figure out why. some disks don't do handle this sequence the way the firmware expects so it starts the COMRESET dance w/ the disk and returns an event w/ the loginfo to the driver/kernel. the 'fix' (really a workaround) is in mpt2sas_scsih.c:_scsih_io_done(). in the case for MPI2_IOCSTATUS_SCSI_TASK_TERMINATED change the DID_RESET to DID_SOFT_ERROR and the rest of the scsi layer will go down the regular retry handling and you'll get out of the 'loop'. lsi supposed to have this fix coming soon. disabling NCQ will 'fix' this as well. \p -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html