On Mon, 2008-01-14 at 22:03 +0100, Vojtech Pavlik wrote: > On Mon, Jan 14, 2008 at 02:03:45PM -0600, James Bottomley wrote: > > > > On Mon, 2008-01-14 at 11:45 -0800, Darrick J. Wong wrote: > > > On Mon, Jan 14, 2008 at 03:49:16PM +0100, Jan Sembera wrote: > > > > Hi, > > > > > > > > we have array of 16 SAS disks connected to Adaptec controllers > > > > ... > > > > this elsewhere and I was recommended to send it to linux-scsi. > > > > > > Hmm... I think Peter Bogdanovic was hitting this error recently (cc'd). > > > There are a lot of PRIMITIVE_RECVD messages in the log, which make me > > > wonder if the expander is being flaky or something? The commands that > > > start timing out under heavy load followed by the repeated broadcasts > > > might be indicative of that, since the sequencer firmware and the kernel > > > driver are up to date. Unfortunately, I don't have any LSI expanders... > > > > I do, and actually, I've seen behaviour like this, except on a SATAPI > > DVD not a disk. What seems to happen is that the expander hangs up on > > the device and I can't recover it except by power cycling the expander > > (other devices on the expander continue to work normally). > > It'd be rather hard to power cycle the 16-drive backplane with dual > LSISASx28 expanders in this server without bringing the rest of the > system down. > > If the backplane was as flaky as you suggest, I doubt anyone could use > these machines in production, even under other OSs ... I'm merely telling you what I see in my LSI expanders. However, one of the characteristics is that I can't get any response even to a hard reset on the port (that's echo 1 > /sys/class/sas_phy/<phy>/hard_reset) if it is the same problem. > > The problem is (if it is the same problem) there isn't any defined error > > recovery from this ... the standards don't contain an expander reset, > > and the expander isn't responding to the phy reset (either hard or > > soft). So I'm not sure what can be done at this point. > > In our last test run, we've received some more errors, but this time the > system recovered and actually finished the test load: It could just be a simple failure in the error handler then. libsas implements its own, so I bet there are a few corner cases ... James - To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html