On Wed, Mar 10, 2010 at 02:19:53PM +0530, Desai, Kashyap wrote: > Bron, > > See > https://bugzilla.novell.com/show_bug.cgi?id=554152 > > From first look they might be similar issue or may be of same type of issue. Yes, it does look very similar. Basically it's a timeout error on the drive (or block device as exported by an external unit in our case). It's frustrating that it can't abort just the one block device's IO and reset that device rather than having to re-negotiate with every device. > Some information I have shared with you which I have already received for novell bugzilla. > > Can you do one experiment? > Immediate after this issue do HBA reset using sg_reset and please share result of the experiment. I can't do that - because I don't usually know about the error immediately. I could probably have something check dmesg every second and run sg_reset when it found a new one, but that's pretty complex! These are production boxes as well. Is there much point? It's pretty clear from the debug logs that the driver is doing at least a bus reset each time, and I'm guessing a host reset when it does the full re-negotiate of line speeds. > Which OS version and driver version this issue has been introduce? > I am interested to know is there any regression from driver point of view? We're running Linux kernel 2.6.32.9 with one very small patch to add some SIGKILL debugging (not necessary now, we solved that issue, but I haven't pulled the patch from our production machines yet). The issue has certainly existed in earlier kernels, I'm pretty sure back to the 2.6.27 kernels which we're still running on some machines that haven't been restarted for a while. However - it's looking like a firmware upgrade on the drives inside the unit has solved the issue. > > My question (if you've read this far! Thanks) is what's triggering > > this complete flush of ALL IO to the device, and also triggering a > > complete SCSI renegotiation the first time it happens (lost in the > > mists of dmesg now) - not only on this device but on the other unit > > which is attached to the other channel on the card. Is there a reason that a timeout on one device is causing a bus or host reset? It seems that if the bus reset doesn't unstick it the host reset is going to be pointless. The underlying problem seems to be chained timeouts not adding up. In particular the RAID controller and disk timeouts weren't matching up. I'm still harassing the vendor about why this isn't causing anything to be logged in the THEIR debug logs. There's no evidence on the RAID unit that anything has gone wrong until it gets a bus reset. It logs that. > > Is this likely to be related to the 127 outstanding items (queue full) > > there? I notice sd 1:0:0:0: (sdb) has a full 64 items in the queue as > > well, which is the maximum queue size according to the code. Obviously not at all, because later debugging managed to get one that only had 6 outstanding IO requests. It looks like the external unit was still servicing IO requests to all the other RAIDsets just fine - it was the the 30 second wait on one channel that caused the failure. I also have a suspicion that the way we've configured things (multiple virtual "volumes" on a single RAIDset) is likely to confuse the queues on the Linux end. If the RAIDset is timing out responding to one request it will cause requests on all the volumes to back up. Regards, Bron. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html