Re: Broken harddisk

"T. Ermlich" <pelegrine@xxxxxxx> · Sat, 29 Jan 2005 16:34:35 +0100

Hi,

I'd like to say Thanks to everyone replied till now! :-)

Gordon Henderson scribbled on 29.01.2005 13:46:
On Sat, 29 Jan 2005, T. Ermlich wrote:

Hello there,

I just got here from http://cgi.cse.unsw.edu.au/~neilb/Contact ...
Hopefully I'm more/less right here.

Several month ago I set-up an raid1 using mdadm.
Two drives (/dev/sda & /dev/sdb, each one is an 160GB Samsung SATA
disks) are used, and provide now /dev/md0, /dev/md1, /dev/md2 &
/dev/md3. In november 2004 I upgraded to mdadm 1.8.1.

Drop 1.8.1 and get 1.8.0. I understand 1.8.1 has some experimental code
and not designed to be used for real.

This afternoon, about 9 hours ago, /dev/sda broke down ... no chnace to
get it working again .. :(

My question now is: what does I have to do now?

Well, go through the procedure to remove the disk and put a new one back
in...

Ok ... as the broken disks stops the system, and during boot procedure
the system hung, I had to remove it (disconnected the cables).

The system is up and running, so I'd do an actual backup of the most
important data ... but how to 'replace' the broken drive, and 'restore'
the data content there (sorry, as english is not my native language I
have no idea how to explain it correctly).
Is there a way to do so, or does I have to create an raid1 from scratch,
and copy all data from /dev/md0-3 there manually?

You should not have to copy it - thats the whole point of it all, however,
RAID is not a substitute for proper backups, so make sure you do those
backups now and regularly in the future.

Backups are done very night (3 am), so I just made a backup of the
latest changes (between ~3am and 15:30pm).

OK - here are the basic steps - you may have to modify them as you haven't
posted enough detail for me to work it out to your exact system.

I'm assuing that you have partitioned each disk with 4 partitions and both
disks are partitioned identically and you are combining the same partition
of each device into the md devices. (eg. /dev/md0 is made from /dev/sda1
and /dev/sdb1) This is reasonably "sane" and I'm sure lots of people do it
this way (I do, but I'm a small sample :) If you aren't doing it this way,
then this won't work for you, but you may be able to adapt it for your
needs.

That's right: each harddisk is partitioned absolutly identically, like:
    0 - 19456 - /dev/sda1 - extended partition
    1 - 6528  - /dev/sda5 - /dev/md0
 6529 - 9138  - /dev/sda6 - /dev/md1
 9139 - 16970 - /dev/sda7 - /dev/md2
16971 - 19456 - /dev/sda8 - /dev/md3
And after doing those partitionings I 'combined' them to act as raid1.

Firstly, get mdadm 1.8.0 as I mentioned above.

Look at /proc/mdstat.

See if all 4 md devices have a failed device in it. If the disk is really
dead, this is likely to be the case, if it's not, then you'll need to fail
each partition in each md device:

So make make sure that each md device has the failed disk really failed,
you can do:

  mdadm --fail /dev/md0 /dev/sda1
  mdadm --fail /dev/md1 /dev/sda2
  mdadm --fail /dev/md2 /dev/sda3
  mdadm --fail /dev/md3 /dev/sda4

Next, you need to remove the failed disk from each array

  mdadm --remove /dev/md0 /dev/sda1
  mdadm --remove /dev/md1 /dev/sda2
  mdadm --remove /dev/md2 /dev/sda3
  mdadm --remove /dev/md3 /dev/sda4

Strictly speaking, you don't have to do this - you can just power down and
put a new disk in, but I feel this is "cleaner" and hopefully leaves the
system in a stable and known state when you do power down.

Habven't done that, b/c the system was already down ...

At this point you can power down the machine and physically remove the
drive and replace it with a new, identical unit.

So I did: replaced the broken one (Samsung SP1614C) with an identical drive.

Reboot your PC. If it would normally boot off sda, you have to persuade it
to boot off sdb. You might need to alter the bios to do this, ot maybe
not... All BIOSes and controllers have their own little ideas about how
this is done.

If it boots off another drive (eg. an IDE drive) then you should be fine.
If it does boot off sda, then I hope you used the raid-extra-boot command
in lilo.conf (and tested it...) If you are using grub, I can't be of any
assistance there as I don't use it.

I have two additional IDE drives in that system.

/dev/hda contains some data, and is the boot drive, /dev/hdb contains 
some less important data.

You should now have the system running with the data intact on sdb and all
the md devices working and mounted as normal.

Now you have to re-partition the new sda identical to sdb. If they are the
same make and size, you can use this:

  sfdisk -d /dev/sdb | sfdisk /dev/sda

This didn't work proper, so I partitioned the new drive manually.

Now, tell the raid code to re-mirror the drives:

  mdadm --add /dev/md0 /dev/sda1
  mdadm --add /dev/md1 /dev/sda2
  mdadm --add /dev/md2 /dev/sda3
  mdadm --add /dev/md3 /dev/sda4

Now some new trouble starts ...?

'mdadm --add /dev/md0 /dev/sda1' started just fine - but exactly at 50% 
it started giving tons of errors, like:

[quote]

Jan 29 16:10:24 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed

Jan 29 16:10:24 suse92 kernel: end_request: I/O error, dev sdb, sector 
52460420

Jan 29 16:10:25 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }

Jan 29 16:10:25 suse92 kernel: ata2: error=0x40 { UncorrectableError }

Jan 29 16:10:25 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b 85 00 02 f9 00

Jan 29 16:10:25 suse92 kernel: Current sdb: sense key Medium Error

Jan 29 16:10:25 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed

Jan 29 16:10:25 suse92 kernel: end_request: I/O error, dev sdb, sector 
52460421

Jan 29 16:10:26 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }

Jan 29 16:10:26 suse92 kernel: ata2: error=0x40 { UncorrectableError }

Jan 29 16:10:26 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b 86 00 02 f8 00

Jan 29 16:10:26 suse92 kernel: Current sdb: sense key Medium Error

Jan 29 16:10:26 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed

Jan 29 16:10:26 suse92 kernel: end_request: I/O error, dev sdb, sector 
52460422

Jan 29 16:10:27 suse92 kernel: ata2: status=0x51 { DriveReady 
SeekComplete Error }

Jan 29 16:10:27 suse92 kernel: ata2: error=0x40 { UncorrectableError }

Jan 29 16:10:27 suse92 kernel: scsi1: ERROR on channel 0, id 0, lun 0, 
CDB: Read (10) 00 03 20 7b 87 00 02 f7 00

Jan 29 16:10:27 suse92 kernel: Current sdb: sense key Medium Error

Jan 29 16:10:27 suse92 kernel: Additional sense: Unrecovered read error 
- auto reallocate failed

Jan 29 16:10:27 suse92 kernel: end_request: I/O error, dev sdb, sector 
52460423

[/quote]

then run:

   watch -n1 cat /proc/mdstat

and wait for it to finish, however the system is fully usable all during
this process.

[quote]

Every 1,0s: cat /proc/mdstat 
                             Sat Jan 29 16:08:50 2005

Personalities : [raid1]
md3 : active raid1 sdb8[1]
      19960640 blocks [2/1] [_U]

md2 : active raid1 sdb7[1]
      62910400 blocks [2/1] [_U]

md1 : active raid1 sdb6[1]
      20964672 blocks [2/1] [_U]

md0 : active raid1 sdb5[1] sda5[2]

      52436032 blocks [2/1] [_U]

      [==========>..........]  recovery = 50.0% (26230016/52436032) 
finish=121.7min speed=1050K/sec

unused devices: <none>

[/quote]

Can I stop that process for /dev/md0, and start with /dev/md1 (just to 
compare if its a problem with that partition only, or an general problem 
(so that eg. the second drive has problens, too)?

btw: does mdadm also format the partitions?

If you can't power the machine down, and have hot-swappable drives in
proper caddys, then there is a way to tell the kernel that you are
removing the drive and adding a new one in, however it's probably safer if
you can do it while powered down.

If this doesn't make sense, post back the output of /proc/mdstat and
fdisk -l

Goos luck!

Gordon

Have a nice day
Torsten

-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html