At 11:36 3/10/2008, James Bottomley wrote:
>On Fri, 2008-03-07 at 17:40 -0500, Marc Bejarano wrote:
>> This is hard to explain. It looks like page 309713975 got written
>> out to the proper spot, but then the first 10752 bytes got written
>> out again to the wrong spot?!?
>
>I'm afraid your not going to like this, but this pattern of corruption
>is almost completely definitive of a disk problem with head positioning.
are you kidding? i LOVE this :) just to have a working theory is a
huge relief.
>It's still theoretically possible that something went wrong in the
>actual HBA, but I'd place most of my money on a disk fault.
at this point, i'd do likewise.
>Just as a matter of interest, what version of firmware do you have?
one of our early suspects was drive firmware. we'd already been
bitten once by a 7200.10 firmware "upgrade" messing us up. this box
was originally using a mix of 3.AAJ's and 3.AAK's, but since these
were our first K's, we took them out of the picture. since we have
lots of J's in active use and had never seen any problems, i assumed
they were fine and looked elsewhere. going back over some other
productions machines, it looks like all the important stuff is on
pre-J's. we don't seem to have J's in high-stress environments.
>I'm afraid the only way to confirm this theory definitively will be with
>the destructive disktest from autotest (it was actually constructed to
>check for drive head positioning errors)
thanks to you (and grant) for the pointer! will try that next.
cheers,
marc
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html