RE: RAID halting

"David Lethe" <david@xxxxxxxxxxxx> · Sun, 5 Apr 2009 09:53:57 -0500

> -----Original Message-----
> From: linux-raid-owner@xxxxxxxxxxxxxxx [mailto:linux-raid-
> owner@xxxxxxxxxxxxxxx] On Behalf Of David Lethe
> Sent: Sunday, April 05, 2009 9:23 AM
> To: lrhorer@xxxxxxxxxxx; linux-raid@xxxxxxxxxxxxxxx
> Subject: FW: RAID halting
> 
> -----Original Message-----
> From: linux-raid-owner@xxxxxxxxxxxxxxx
> [mailto:linux-raid-owner@xxxxxxxxxxxxxxx] On Behalf Of Lelsie Rhorer
> Sent: Sunday, April 05, 2009 3:14 AM
> To: linux-raid@xxxxxxxxxxxxxxx
> Subject: RE: RAID halting
> 
> > All of what you report is still consistent with delays caused by
> having
> > to remap bad blocks
> 
> I disagree.  If it happened with some frequency during ordinary reads,
> then
> I would agree.  If it happened without respect to the volume of reads
> and
> writes on the system, then I would be less inclined to disagree.
> 
> > The O/S will not report recovered errors, as this gets done
> internally
> > by the disk drive, and the O/S never learns about it. (Queue depth
> 
> SMART is supposed to report this, and rarely the kernel log does
report
> a
> block of sectors being marked bad by the controller.  I cannot speak
to
> the
> notion SMART's reporting of relocated sectors and failed relocations
> may
> not
> be accurate, as I have no means to verify.
> 
> Actually, I should amend the first sentence, because while the ten
> drives in
> the array are almost never reporting any errors, there is another
drive
> in
> the chassis which is chunking out error reports like a farm boy
> spitting
> out
> watermelon seeds.  I had a 320G drive in another system which was
> behaving
> erratically, so I moved it to the array chassis on this machine to
> eliminate
> it being a cable or the drive controller.  It's reporting blocks being
> marked bad all over the place.
> 
> > Really, if this was my system I would run non-destructive read tests
> on
> > all blocks;
> 
> How does one do this?  Or rather, isn't this what the monthly mdadm
> resync
> does?
> 
> > along with the embedded self-test on the disk.  It is often
> 
> How does one do this?
> 
> > a lot easier and more productive to eliminate what ISN'T the problem
> > rather than chase all of the potential reasons for the problem.
> 
> I agree, which is why I am asking for troubleshooting methods and
> utilities.
> 
> The monthly RAID array resync started a few minutes ago, and it is
> providing
> some interesting results.  The number of blocks read / second is
> consistently 13,000 - 24,000 on all ten drives.  There were no other
> drive
> accesses of any sort at the time, so the number of blocks written was
> flat
> zero on all drives in the array.  I copied the /etc/hosts file to the
> RAID
> array, and instantly the file system locked, but the array resync *DID
> NOT*.
> The number of blocks read and written per second continued to range
> from
> 13,000 to 24,000 blocks/second, with no apparent halt or slow-down at
> all,
> not even for one second.  So if it's a drive error, why are file
system
> reads halted almost completely, and writes halted altogether, yet
drive
> reads at the RAID array level continue unabated at an aggregate of
more
> than
> 130,000 blocks - 240,000 blocks (500 - 940 megabits) per second?  I
> tried a
> second copy and again the file system accesses to the drives halted
> altogether.  The block reads (which had been alternating with writes
> after
> the new transfer proceses were implemented) again jumped to between
> 13,000
> and 24,000.  This time I used a stopwatch, and the halt was 18 minutes
> 21
> seconds - I believe the longest ever.  There is absolutely no way it
> would
> take a drive almost 20 minutes to mark a block bad.  The dirty blocks
> grew
> to more than 78 Megabytes.  I just did a 3rd cp of the /etc/hosts file
> to
> the array, and once again it locked the machine for what is likely to
> be
> another 15 - 20 minutes.  I tried forcing a sync, but it also hung.
> 
> <Sigh>  The next three days are going to be Hell, again.  It's going
to
> be
> all but impossible to edit a file until the RAID resync completes.
> It's
> often really bad under ordinary loads, but when the resync is
underway,
> it's
> beyond absurd.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid"
> in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> ======
> Leslie:
> Respectfully, your statement, "SMART is supposed to report this" shows
> you have no understanding of exactly what S.M.A.R.T. is and is not
> supposed to report, nor do you know enough about hardware to make an
> educated decision about what can and can not be contributing factors.
> As such, you are not qualified to dismiss the necessity to run
hardware
> diagnostics.
> 
> A few other things - many SATA controller cards use poorly architected
> bridge chips that spoof some of the ATA commands, so even if you
> *think*
> you are kicking off one of the SMART subcommands, like the
> SMART_IMMEDIATE_OFFLINE (op code d4h with the extended self test,
> subcommand 2h), then it is possible, perhaps probable, they are never
> getting run. -- yes, I am giving you the raw opcodes so you can look
> them up and learn what they do.
> 
> You want to know how it is possible that frequency or size of reads
can
> be a factor?
> Do the math:
>  * Look at the # of ECC bits you have on the disks (read the specs),
> and
> compare that with the trillions of bytes you have.  How frequently can
> you expect to have an unrecoverable ECC error based on that.
>  * What percentage of your farm are you actually testing with the
tests
> you have run so far? Is it even close to being statistically
> significant?
>  * Do you know what physical blocks on each disk are being
read/written
> with the tests you mention? If you do not know, then how do you know
> that the short tests are doing I/O on blocks that need to be repaired,
> and subsequent tests run OK because those blocks were just repaired?
>  * Did you look into firmware? Are the drives and/or firmware
revisions
> qualified by your controller vendor?
> 
> I've been in the storage business for over 10 years, writing
everything
> from RAID firmware, configurators, disk diagnostics, test bench
suites.
> I even have my own company that writes storage diagnostics.  I think I
> know a little more about diagnostics and what can and can not happen.
> You said before that you do not agree with my statements earlier.  I
> doubt that you will find any experienced storage professional that
> wouldn't tell you to break it all down and run a full block-level DVT
> before going further.  It could have all been done over the week-end
if
> you had the right setup, and then you would know a lot more than what
> you know now.
> 
> AT this point all you have done is tell people who suggest hardware is
> the cause that they are wrong and then tell us why you think we are
> wrong.  Frankly, be lazy and don't run diagnostics, you had just
better
> not be a government employee, or in charge of a database that contains
> financial, medical, or other such information, and you have better be
> running hot backups.
> 
> If you still refuse to run full block-level hardware test, then ask
> yourself how much longer will you allow this to go on before you run
> such a test, or are you just going to continue down this path waiting
> for somebody to give you a magic command to type in that will fix
> everything.
> 
> I am not the one who is putting my job on the line at best, and at
> worst, is looking at a criminal violation for not taking appropriate
> actions to protect certain data. I make no apology for beating you up
> on
> this.  You need to hear it.

P.S. Your other mistake was even assuming that your configuration would
ever "work" in the first place, even if every disk passed full
diagnostics.  How do you know there aren't any firmware bugs biting you
.. everything works fine individually, but combinations of the 2.1TB
overflow, TCQ, old drive firmware, configurable drive settings, etc..
all don't work together to create chaos above and beyond block read
errors.

Take the FS out of the equation, boot to a CDROM and do some reads of
the raw md devices, rather than mounted filesystem.

Maybe your config doesn't work now, is because it wouldn't have ever
worked in the first place.

-David

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html