Re: RAID 6 Failure follow up

Vincent Schut <schut@xxxxxxxxxxxx> · Wed, 11 Nov 2009 13:46:41 +0100

Andrew Dunn wrote:
Thanks for your help, so far without smartctl installed I have had no
issues... but it has only been about 12 hours.
I also had no issues when not running smartd/smartctl. It seems the 
combination of kernel, backplane SAS driver, and smart which triggers 
the trouble...

Could you send me your smatd.conf?

It's pretty much default, there's just one uncommented line in it:

DEVICESCAN -d scsi -a -o on -S on -s (S/../.././02|L/../../6/03) -W 
4,45,55 -R 5 -m my@xxxxxxxxxxxx -M exec 
/usr/share/smartmontools/smartd-runner

(the above 3 lines should be all on one line).
I plan to replace the devicescan with explicit /dev/sd.. items, but as 
I'm currently regularly adding and removing (usb) drives, I kept the 
auto devicescan statement.
The rest means: enable smart on all drives, plan daily short and weekly 
long selftests, and warn on temperature too high or temp change of more 
than 5 deg., and mail warnings/errors to me.

VS.

Vincent Schut wrote:
Andrew Dunn wrote:
I am able to reproduce this smart error now. I have done it twice, so
maybe other things are causing this also.

When I scanned the devices this morning with smartctl via webmin I lost
8 of the 9 drives. They are howerver still in my /dev folder.

Now I sent out my logs from the first failure last night, smartctl was
on the system... I dont know if ubuntu server's default smartd
configuration makes it do periodic scans because I didnt change
anything.

I would hate to move back to 9.10 and see this problem again.

Should I just not install smartmontools? This seems like a bad solution
because now I wont be able to check the drives in advance for failures.

Have you installed LSI's linux drivers? Some people say this solves
their issue.

From the logs sent out last night do you think it could be something
else?

Thanks a ton,
FWIW, I encountered the same issue, and seem to have found a viable
workaround by accessing the SATA disks on that LSI backplane as scsi
devices, e.g. by adding '-d scsi' to my smartctl/smartd.conf lines. No
more errors in the logs, no more drives being kicked out.
Though not as much info is available that way as when using de sata
driver ('-d sat', or automatically), like temperature is unavailable,
it does allow me to initiate the selftests and get their result, and
to monitor generic smart status of the drives. Quite enough for me.

YMMV, though.

Vincent.
Gabor Gombas wrote:
On Mon, Nov 09, 2009 at 05:08:23AM -0500, Andrew Dunn wrote:

does it momentarily offline the disks? like they re-appear in /dev
within moments? That would be similar behavior to what I am
experiencing, the disks drop from the array, but they are in /dev
by the
time I get a chance to see them.

No, either the disks need to be physically removed and re-inserted, or
the machine needs to be rebooted.

Gabor

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html