Re: new problem has developed

maarten van den Berg <maarten@vbvb.nl> · Thu, 30 Oct 2003 14:14:45 +0100

On Wednesday 29 October 2003 06:52, Mark Hahn wrote:
> > For various reasons I decided to decommission the old hardware (AMD K6)
> > and I built a newer (and 100% known-good) board in it earlier today. That
> > makes a BIG difference in initial speed, I now get 14000K/sec instead of
> > the dead slow AMD K6 did. However, at 5.2% the speed drops significantly.
> > We're now back at 5.3% and speed has dropped from 13000K to 170K and
> > continues to drop.
>
> this sort of thing *can* actually occur because of sick disks.

Thanks for replying.  Yes, it was a bad disk and I solved it eventually.

> > I investigated already on the old machine with several tools, of course
> > mdadm, but also iostat and keeping an eye on /var/log/messages.  All
> > seems proper.
>
> smartctl on the disks?

If only my BIOS would support that... :-( 
I don't know if it's the main BIOS or the promise cards that must support it, 
but 'ide-smart' just gives no output at all.

I did a 'badblocks' on one disk that was part of the array but already got 
kicked twice from it.  Lo and behold, starting at about 4GB it developed a 
problem (slow reads due to endless retries).  As I desperately NEEDED this 
drive (my array was already degraded!) I decided to use 'dd_rescue' to clone 
it to a good disk and re-assemble the array from there. The dd_rescue 
operation took more than 30 hours(!) and showed that there was a problem 
around the 4GB and also around 71 GB markers. Several MB could not be 
recovered (which is close to nothing, percentage-wise).
Mdadm then reassembled the array with the fresh drive, and subsequent 
hot-adding went as fast as it should. One day later I added a new hot-spare. 
All is well now. I will surely find corrupted data at some point due to the 
missing MB's.  But I see no way to avoid this anyhow...
I just hope it is a file, not reiserfs meta-data, that got killed.

Taking into account that dd_rescue took 30 hours it stands to reason that 
maybe the resync would have worked after all, if only I would have let it run 
longer.  The problem is partly that the resync just seems to grind to a halt, 
whereas dd_rescue is much more verbose in what it does.  If I could 
distinguish between a 'crash' and a slow process (that still works -albeit 
slow) this probably wouldn't have happened. Well, now we know...

> > I'm unsure if this could be due to a disk hardware fault but then it
> > would surely show up in syslog, right ?
>
> no.  there's no syslog-over-ata/scsi afaikt ;)
>
> > Could disk corruption be the culprit ? My
>
> I'd guess vibration.  I've experienced several kinds of recent disks that
> under bad conditions (vibration, near-death) just get amazingly slow,
> but continue to work.  this is, of course, really, really good...

They vibrate, yeah.  That's just what happens if you put eight disks together 
in a cabinet and put two 120mm papst fans right in front of them...  ;-)
(But at least they stay quite cool, really quite cool...)

Maarten

-- 
Yes of course I'm sure it's the red cable. I guarante[^%!/+)F#0c|'NO CARRIER
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html