Re: SATA errors?

Justin Piszcz <jpiszcz@xxxxxxxxxxxxxxx> · Wed, 1 Oct 2008 18:03:50 -0400 (EDT)

On Wed, 1 Oct 2008, Danilo Godec wrote:

I don't want to start any holly wars, but I'm not using a RAID controller. 
It's just a plain old on-board SATA controller (at least that's what I think 
it is):

00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA Storage 
Controller AHCI (rev 09)

David Lethe wrote:
If this was my system, I would ...
1) First check into upgrading firmware/bios/drivers of disk controller.
2) Look at cron jobs and see if anything that needs capacity runs around
the time the errors are reported.  Something has to run to start this
off, so you need to find it.  3) Use logger and a shell script to try to 
catch system in the 15 second
window when you have this problem, and see what programs are running. 4) 
Actually, if this was my system, and if I/O wasn't actually being
suspended during those 15 seconds, then I probably would do step 1 only,
and if everything is current, then I would move on and not worry about
it.   Even if you find the offending program, then that doesn't mean
that the author of the program has or will make an acceptable change in
their code. 
1. I will get a new server in a couple of days and then I'll be able to move 
the Xen VM's from the 'problematic' server. Then I'll see what can be 
updated/upgraded.
2. The errors are pretty much random and there is nothing in the cron at all. 
I don't think Xen VM's could do anything with the physical drive, so their 
crons shouldn't be relevant.
3. It's not really a problem that we (the users) would feel. It's just the 
logs that got me worried (I don't like unexplainable hard drive errors).
4. As said before, I changed the scripts to use 'smartctl' with one of the 
other drives. So far it seems better - there hasn't been a single error in 12 
hours.
The problem you have from SCSI perspective is the bozo who wrote this
chunk of code did it the wrong way. The CORRECT way to determine
addressable blocks is to send out the READCAP10, look for return value
of FFFFFFFFh blocks, then issue a 16-byte (0x9e 0x10) READ CAPACITY,
because you have > FFFFFFFE blocks on the disk.  This architect never
imagined that the READCAP10 would have to deal with large disks, and
assumed if there was a problem, then the disk needs to be reset. 
If it turns out that 'smartctl' was causing this I'll report it to 
'smartmontools' guys.

Thanks for the help, Danilo

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

I actually turned off smart(daemon) /etc - the problems still persist for 
me..

Justin.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html