On Tue, 2012-12-04 at 16:00 -0600, Chris Friesen wrote: > As another data point, it looks like we may be doing a SEND DIAGNOSTIC > command specifying the default self-test in addition to the background > short self-test. This seems a bit risky and excessive to me, but > apparently the guy that wrote it is no longer with the company. This is a really bad idea. A lot of disks go out to lunch until the diagnostics complete (the same goes for SMART diagnostics). This means that if you do diagnostics on a running device, the drivers start to get timeouts on commands which are queued waiting for diagnostics to complete ... if those go over the standard SCSI timeouts, we'll start to try error recovery and likely have the disaster you see above. > What is the recommended method for monitoring disks on a system that > is likely to go a long time between boots? Do we avoid any in-service > testing and just monitor the SMART data and only test it if something > actually goes wrong? Or should we intentionally drop a disk out of the > array and test it? (The downside of that is that we lose > redundancy since we only have 2 disks.) What do you mean by "monitoring" ... as in what are you looking for? To make sure the disk is healthy and responding, a simple test unit ready works. To look at other parameters, read the mode pages. Anything that actively causes the disk to go out and check something is a bad idea in a running environment. Only do this if you can quiesce the I/O before starting the active diagnostic (or drop the disk from the array as you suggest). To be honest, though, modern disks do a whole host of diagnostics as they write data just to check that it is safely committed, so passive monitoring should be fine. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html