Re: getting I/O errors in super_written()...any ideas what would cause this?

Chris Friesen <chris.friesen@xxxxxxxxxxx> · Thu, 06 Dec 2012 12:15:32 -0600

On 12/05/2012 03:20 AM, James Bottomley wrote:
On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote:
As another data point, it looks like we may be doing a SEND DIAGNOSTIC
command specifying the default self-test in addition to the background
short self-test.  This seems a bit risky and excessive to me, but
apparently the guy that wrote it is no longer with the company.

This is a really bad idea.  A lot of disks go out to lunch until the
diagnostics complete (the same goes for SMART diagnostics).  This means
that if you do diagnostics on a running device, the drivers start to get
timeouts on commands which are queued waiting for diagnostics to
complete ... if those go over the standard SCSI timeouts, we'll start to
try error recovery and likely have the disaster you see above.

So it turns out that our problems are intermittently triggered when 
running the default self test.  This agrees with the statement in 
sg_senddiag to not do foreground self-tests on disks with mounted 
filesystems.

We seem to be able to do background short self-tests (ie, SEND 
DIAGNOSTIC command with self-test code of 001b and ST code of 0b) 
without causing any problems.  Is this pushing our luck or is this 
something that should work according to the spec and the linux stack? 
The scsi spec indicates that in this case for most commands the test 
will be paused and the command executed within 2 seconds, but I don't 
know what the normal scsi timeouts are.

Thanks for the input, this is very useful.

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html