Re: Wierd: Degrading while recovering raid5

Adam Goryachev <mailinglists@xxxxxxxxxxxxxxxxxxxxxx> · Tue, 10 Feb 2015 18:35:09 +1100

Hi Kyle,

There are other people who will jump in and help you with your problem, 
but I'll add a couple of pointers while you are waiting. See below.

On 10/02/15 15:20, Kyle Logue wrote:
Hey all:

I have a 5 disk software raid5 that was working fine until I decided
to swap out an old disk with a new one.

mdadm /dev/md0 --add /dev/sda1
mdadm /dev/md0 --fail /dev/sde1

At this point it started automatically rebuilding the array.
About 60%? of the way in it stops and I see a lot of this repeated in my dmesg:

[Mon Feb  9 18:06:48 2015] ata5.00: exception Emask 0x0 SAct 0x0 SErr
0x0 action 0x6 frozen
[Mon Feb  9 18:06:48 2015] ata5.00: failed command: SMART
[Mon Feb  9 18:06:48 2015] ata5.00: cmd
b0/da:00:00:4f:c2/00:00:00:00:00/00 tag 7
[Mon Feb  9 18:06:48 2015]          res
40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
[Mon Feb  9 18:06:48 2015] ata5.00: status: { DRDY }
[Mon Feb  9 18:06:48 2015] ata5: hard resetting link
[Mon Feb  9 18:06:58 2015] ata5: softreset failed (1st FIS failed)
[Mon Feb  9 18:06:58 2015] ata5: hard resetting link
[Mon Feb  9 18:07:08 2015] ata5: softreset failed (1st FIS failed)
[Mon Feb  9 18:07:08 2015] ata5: hard resetting link
[Mon Feb  9 18:07:12 2015] ata5: SATA link up 1.5 Gbps (SStatus 113
SControl 310)
[Mon Feb  9 18:07:12 2015] ata5.00: configured for UDMA/33
[Mon Feb  9 18:07:12 2015] ata5: EH complete

ata5 corresponds to my /dev/sdc drive.
First, check if the drive is faulty.
dd if=/dev/sdc of=/dev/null bs=10M

If that completes without any errors from dd, then the drive can be read 
OK. Now check the logs, was there any errors there? Especially if there 
were errors in the logs, (or even if not) read about timing mismatches 
between the kernel and the hard drive, and how to solve that. There was 
another post earlier today with some links to specific posts that will 
be helpful (check the online archive).

Finally, I think your first mistake was to fail the drive. You should 
have replaced it which will stop you from losing protection from a 
failed drive.
See the second answer to this question:
http://unix.stackexchange.com/questions/74924/how-to-safely-replace-a-not-yet-failed-disk-in-a-linux-raid5-array

Regards,
Adam

--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html