Re: Revive a dead md raid5 array

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Op 20-11-18 om 20:32 schreef Wol's lists:
On 19/11/2018 22:35, Jogchum Reitsma wrote:
Hi,

New to this list, I understand that I can post a problem with an md raid array here. If I'm wrong in that supposition, please let me know, and accept my apologies!

That's right. Here you get the experts, but they may take time responding :-(
No problem! :-)

That said, though I wouldn't call myself an expert, I know enough to think everything will be fine. How easy would it be for you to back up the disk if you recover it read-only?
I'm not sure I understand what you mean by "back up the disk" - I did a ddrescue from the faulty disks to new WD $TB disks, this time from the RED series, which do support SCT/ERC. Isn't that just what you mean by "backup"?

I have a 4-disk rad5 array, of which two of the disks are kicked out because of read errors. The disks are WD Blue 4TB disks, which are still under guarantee.

Just looked at the spec of those drives. It looks a bit worrisome to me. take a look at the raid wiki https://raid.wiki.kernel.org/index.php/Linux_Raid
I already read that, and also (among others) the link https://raid.wiki.kernel.org/index.php/Timeout_Mismatch mentioned there. I'm pretty sure I read somewhere that for software raid NAS disks should *not* be used, so when I created the array I bought WD Blue disks. But, reading the info in the links mentioned, I now bought 2 WD Red disks, which indeed support TERC (as WD names it).

What I would really like to see is whether these drives support SCT/ERC. If they don't there is our first problem. I notice WD says raids 0 and 1, not 5 ...
See my answer above. And there's no problem buying another two WD Red disks, to copy the contents of the other two disks to.

I have reasonable recent backups, but yet I would like to try to get the array alive again.

That shouldn't be hard.

Funny thing is, mdadm -- examine states the array as being raid0:

    /dev/md0:
                Version : 1.2
             Raid Level : raid0
          Total Devices : 4
            Persistence : Superblock is persistent

This seems to be recent glitch in mdadm. Don't worry about it ...
OK, thanks!



=================================================================

Maybe you noticed the fact that all disks are marked as spare, and that the event count of one of the disks /dev/sdf is different from the other's

I found some more occurrences of a raid5 being recognized as a raid0 device, but not a real solution to this.

The solution, iirc, was just to stop the array and re-assemble it - as soon as the array was running, it sorted itself out.


=============================================================================================

The faulty disks are /dev/sda an /dev/sdd, and I copied the contents to new WD RED 4TB disks, with

ddrescue -d -s <size-of-target-disk> -f /dev/sd<source> /dev/sd<target> sd<source>.map

The size argument because the new disks are some 4MB smaller than the original.

ddrescue saw one one disk 14, on the other 54 read errors, and copied 99.99% of the source.

Is the a way to revive the array, and if yes, how can I do that?

Firstly, if the blues don't support SCT/ERC, you NEED NEED NEED to fix the timeout mismatch problem. I suspect that's what blew up your array.
Or recreate the array, with WD Red disks? Funny thing is, the disk with a mismatch in event count was *not* kicked out of the array...

Now because you've got a bunch of read errors, I suspect you're going to lose some data, sorry. You have two choices.

1) Force assemble the array with all four drives, and run a repair. This should fix your read errors, but risks losing data thanks to the event counter mismatch.
Excuse my ignorance here, but what do you mean by "repair"? Run a fsck?

2) Force assemble the array with the three good drives, then re-add the fourth. If you've got bitmaps, and it re-adds cleanly, then run a repair and you *might* get everything back. Otherwise it might just add, so it will re-sync of its own accord, which will give you a clean array with no errors but the read errors will cause data loss. Sorry.

I've got to learn a lot - what do you mean by "If you've got bitmaps"?

As said, I made copies of the disks with read errors, using ddrescue. FAFAICS ddrescue managed to overcome some, though not all, read errors, so I expect the new disks to be better in that respect than the originals.
These new disks support TERC, so that's also improvement.
Wouldn't it be better, with that in mind, to change the disks with read errors with the new ones, and revive the array with those?



Note that before you do any re-assembly, you need to do a "array stop" otherwise pretty much anything you try will fail with "device busy".

Cheers,
Wol

Thanks a lot!

Cheers, Jogchum




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux