On 12/05/2012 03:20 AM, James Bottomley wrote:
On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote:
As another data point, it looks like we may be doing a SEND DIAGNOSTIC
command specifying the default self-test in addition to the background
short self-test. This seems a bit risky and excessive to me, but
apparently the guy that wrote it is no longer with the company.
This is a really bad idea. A lot of disks go out to lunch until the
diagnostics complete (the same goes for SMART diagnostics). This means
that if you do diagnostics on a running device, the drivers start to get
timeouts on commands which are queued waiting for diagnostics to
complete ... if those go over the standard SCSI timeouts, we'll start to
try error recovery and likely have the disaster you see above.
So it turns out that our problems are intermittently triggered when
running the default self test. This agrees with the statement in
sg_senddiag to not do foreground self-tests on disks with mounted
filesystems.
We seem to be able to do background short self-tests (ie, SEND
DIAGNOSTIC command with self-test code of 001b and ST code of 0b)
without causing any problems. Is this pushing our luck or is this
something that should work according to the spec and the linux stack?
The scsi spec indicates that in this case for most commands the test
will be paused and the command executed within 2 seconds, but I don't
know what the normal scsi timeouts are.
Thanks for the input, this is very useful.
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html