Re: data corruption: ext3/lvm2/md/mptsas/vitesse/seagate

Marc Bejarano <beej@xxxxxxxxxxxx> · Tue, 11 Mar 2008 18:14:34 -0400

At 11:36 3/10/2008, James Bottomley wrote:
>On Fri, 2008-03-07 at 17:40 -0500, Marc Bejarano wrote:
>> This is hard to explain.  It looks like page 309713975 got written
>> out to the proper spot, but then the first 10752 bytes got written
>> out again to the wrong spot?!?
>
>I'm afraid your not going to like this, but this pattern of corruption
>is almost completely definitive of a disk problem with head positioning.

are you kidding?  i LOVE this :)  just to have a working theory is a 
huge relief.

>It's still theoretically possible that something went wrong in the
>actual HBA, but I'd place most of my money on a disk fault.

at this point, i'd do likewise.

>Just as a matter of interest, what version of firmware do you have?

one of our early suspects was drive firmware.  we'd already been 
bitten once by a 7200.10 firmware "upgrade" messing us up.  this box 
was originally using a mix of 3.AAJ's and 3.AAK's, but since these 
were our first K's, we took them out of the picture.  since we have 
lots of J's in active use and had never seen any problems, i assumed 
they were fine and looked elsewhere.  going back over some other 
productions machines, it looks like all the important stuff is on 
pre-J's.  we don't seem to have J's in high-stress environments.

>I'm afraid the only way to confirm this theory definitively will be with
>the destructive disktest from autotest (it was actually constructed to
>check for drive head positioning errors)

thanks to you (and grant) for the pointer!  will try that next.

cheers,
marc

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html