Re: Harddisk gone bad

"Theodore Ts'o" <tytso@mit.edu> · Mon, 11 Nov 2002 16:43:20 -0500

On Sun, Nov 10, 2002 at 08:55:26PM +0100, Francesco Peeters wrote:
> Hi all,
> 
> I know this is the EXT3 list, and my problem is with an EXT2
> filesys, but I cannot seem to find a more suitable list on this
> server, and I have seen a lot of knowledge go by on this list and in
> the archives, so I thought I'd give it a try anyway...
> 
> Here goes nothing:
> 
> I am in a terrible problem: My data disk on my Linux server has gone
> bad, with approx 18 GB of data on it, and I never got round to
> installing a abckup system!  :-( (I know: very stupid!) I never
> noticed anything before, but I went on vacation, and after returning
> I simply turned the box on again, and now I have this problem!!!
> 
> It gave an error on a short read (attempt to read block from
> filesystem resulted in short read while trying to open
> /dev/hdc1. Could this be a zero-length partition?) and I ran e2fsck
> -cc on it, which seems to have fixed that, however the following
> inode sweep gives so many 'bad blocks in inode XXXXX', that I am
> afraid that I'll be left with an empty disk once the check is
> done...

I suppose one of these days someone really should write a "hard disk
catastrophe" HOWTO.....

When you have a lot of precious data on a disk that hasn't been backed
up.  The very **first** thing you should do is to get the cursing
yourself for being twenty different kinds of full for not having a
backup system out of your system.  Get that out of your system, so you
don't make any further mistakes.....

Next, get yourself a backup hard drive which is at least as big as the
disk which is in trouble, and do a full disk-to-disk copy of the disk
that's in trouble:

	dd if=/dev/hdc of=/dev/hdd bs=1k conv=sync,noerr

Do this right away, because if the problem was due to hardware
failure, you want to grab a snapshot before the disk gets any worse.    

For experimental purposes, if you're not sure what you're doing, it's
useful to get another spare disk, and make a second-generation copy
from your first primary backup.  That way, you can experiment on the
second-generation copy, and if one recovery technique doesn't work,
you can try again with a different technique, and not have to worry
about making any irrecoverable mistakes.

The first thing I would try at this point, is an "e2fsck -y" on the
second generation backup.  See what you can save when it's all done;
don't forget to check the lost+found directory in the root of the
filesystem.  Sometimes files will end up there.

If that doesn't work, the next steps will require a lot more expertise
and special work.  So I'd start with that, and see how much you can
recover from that.

> Now when I try to do e2fsck /dev/hdc1 I get 'a corruption was found
> in the superblock' When I try e2fsck -b 8193 /dev/hdc1 It claims it
> is not a valid superblock... The same for for instance 32679, a.s.o.

For a 4k filesystem, the backup block is 32768.  But please, make the
full disk-to-disk backups first, and experiment on the backups.  That
way, you don't need to worry about panic-induced mistakes from making
the problem any worse.  

						- Ted

P.S.  For those people for whom backups are just too much effort,
*please* consider using the "e2image" program to snapshot and backup
critical filesystem metadata.  It's not a replacement for doing full
data backups, but at least if you have an e2image dump, in the worst
case you'll be able to recover more files if a disk failure damages
your inode table.  The problem without the inode table there is no
record of which blocks go with which files, which means that
recovering files because a very, very painful manual process.  e2image
will create a backup copy of the inode table, which even if it is not
fully up-to-date, will be a help in trying to reconstruct data from a
filesystem after a disk failure.  Of course, the real answer is to do
real backups.....

_______________________________________________

Ext3-users@redhat.com
https://listman.redhat.com/mailman/listinfo/ext3-users