On 07/09/18 06:14, Francois Goudal wrote:
Hello,
I've been running a 5-disks RAID5 volume with an ext4 filesystem on a
Synology NAS since 2012 without any problems, until, last week, bad
things happened.
At first, my disk in slot 5 "failed". I'm putting quotation marks here
because as I'll explain later, I later found out that the disk is
actually in good shape, so it might have been a controller issue, who
knows...
At this point, the array is degraded but still fully working. I don't
do anything other than ordering another disk for replacement.
Couple days later, new disk gets delivered. I remove the failed disk
from slot 5, put in the new disk and initiate the resync of the volume.
Of course, half way through, what had to happen happenned. Got URE on
disk in Slot 1. Disk is marked failed and volume is also failed as a
consequence of 2 disks missing.
Now, it's time to think about recovery, because I unfortunately do not
have a very recent backup of the data (lesson learned, won't do this
ever again).
At this point, I decide to freeze everything before trying anything
stupid.
I took all 5 original disks from the NAS out and connected them to a
linux machine and went through a very lengthy process of running
ddrescue to image them all.
- Slot 5 disk (the first one that failed) happens to read properly,
no errors at all...
- Slot 1 disk (the one who failed next with URE) has 2 consecutive
sectors (1kb) at approx 60% of the volume that can't be read, all
other data reads fine
- Slots 2, 3 and 4 disks read fine
So, I now have full images of all 5 disks I can safely work on. They
are on a LVM-based volume and I have a snapshot, so I can easily try
and fail with bad mdadm commands and easily go back to original dumps.
My Events counter on disks looks like this:
root@lab:/# mdadm --examine /mnt/dump2/slot{1,2,3,4,5}.img | grep Event
Events : 2357031
Events : 2357038
Events : 2357041
Events : 2357044
Events : 2354905
Disk 5 is way behind, which is normal since the array was kept running
for a couple days after that disk failed.
Disks 1,2,3 and 4 are all pretty close. They are not exactly the same
number, but I think this is because I didn't stop the raid volume
before pulling the disks out, so each time a disk was pulled, the
Array State in the superblock was updated on the remaining disks. My
mistake here, but hopefully not going to be a big deal ?
So, my conclusion at this point is that I probably still have a
consistent state with disks 1,2,3 and 4 (except that I have a known
1kb of data that's corrupted, but shouldn't be a very big deal, those
sectors may have not been used at all by the filesystem, and even if
they did, this shouldn't prevent me from recovering most of my files,
as long as I can reassemble the volume somehow).
I was thinking about trying something like mdadm --assemble
--assume-clean --level=5 --raid-devices=5 /dev/md0 /dev/loop0
/dev/loop1 /dev/loop2 /dev/loop3 missing
(with /dev/loop0-4 respectively pointing to my disks 1-4, and
declaring disk 5 as missing)
Haven't tried this yet, would this be the right approach ? Any other
suggestions are welcome.
Personally, I think this is the right "next step", but if you wanted to
recover 100% of your data, then I'd follow this process (but I don't
know all the precise magic commands... but perhaps more research and/or
trial and error, and/or someone else will jump in with the details:
1) If you can identify the URE blocks, then you could use the disks
2,3,4 and the original disk5 to recalculate the correct values for
disk1, and write this into the image copy (or write this to the original
disk1, which should either resolve the URE or remap to another physical
sector and solve the URE.
2) Then you will need to research the timeout issue and URE's and your
disks, and fix the timeout issue (assuming that is what caused the
original problem with disk5, and potentially the problem with disk1
during the rebuild).
3) Then you can re-add the new disk5, and allow the resync to complete.
4) If possible, wipe the original disk5, and add to the array as a
spare, or even better, convert to RAID6
5) Enable regular checks of the array so that you will detect URE's
before they become a problem (during a rebuild)
6) Enjoy many more years of trouble free operation
Hope that helps, but it sounds like as far as data recovery goes, you
are in an excellent position to recover everything.
Regards,
Adam
Thanks in advance.
Pasting below the output of some commands:
root@lab:/# mdadm --examine /mnt/dump2/*.img
/mnt/dump2/slot1.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 89299ff0:6fa8ac04:0beea54f:bc0674c8
Update Time : Tue Aug 28 22:16:42 2018
Checksum : 5d04dd8d - correct
Events : 2357031
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 0
Array State : AAAAA ('A' == active, '.' == missing)
/mnt/dump2/slot2.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 6d56ce4e:f49d35da:96069592:056b4055
Update Time : Tue Aug 28 22:22:19 2018
Checksum : 60737dfa - correct
Events : 2357038
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 1
Array State : .AAA. ('A' == active, '.' == missing)
/mnt/dump2/slot3.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 3a8a9c7d:8711e931:3b64eee5:fd9461c9
Update Time : Sat Sep 1 21:56:49 2018
Checksum : ae71ed02 - correct
Events : 2357041
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 2
Array State : ..AA. ('A' == active, '.' == missing)
/mnt/dump2/slot4.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 31f3790a:85db548a:a84d2754:c75854e8
Update Time : Sun Sep 2 06:38:53 2018
Checksum : 20e8478a - correct
Events : 2357044
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 3
Array State : ...A. ('A' == active, '.' == missing)
/mnt/dump2/slot5.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : e046df58:28bb1715:160ed2d5:6e2aae94
Update Time : Fri Aug 24 22:00:11 2018
Checksum : 2810ff0a - correct
Events : 2354905
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 4
Array State : AAAAA ('A' == active, '.' == missing)
root@lab:/# mdadm --examine /mnt/dump2/slot{1,2,3,4,5}.img
/mnt/dump2/slot1.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 89299ff0:6fa8ac04:0beea54f:bc0674c8
Update Time : Tue Aug 28 22:16:42 2018
Checksum : 5d04dd8d - correct
Events : 2357031
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 0
Array State : AAAAA ('A' == active, '.' == missing)
/mnt/dump2/slot2.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 6d56ce4e:f49d35da:96069592:056b4055
Update Time : Tue Aug 28 22:22:19 2018
Checksum : 60737dfa - correct
Events : 2357038
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 1
Array State : .AAA. ('A' == active, '.' == missing)
/mnt/dump2/slot3.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 3a8a9c7d:8711e931:3b64eee5:fd9461c9
Update Time : Sat Sep 1 21:56:49 2018
Checksum : ae71ed02 - correct
Events : 2357041
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 2
Array State : ..AA. ('A' == active, '.' == missing)
/mnt/dump2/slot4.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : 31f3790a:85db548a:a84d2754:c75854e8
Update Time : Sun Sep 2 06:38:53 2018
Checksum : 20e8478a - correct
Events : 2357044
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 3
Array State : ...A. ('A' == active, '.' == missing)
/mnt/dump2/slot5.img:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x0
Array UUID : 76ec0964:7491b265:25110f4d:81d88cc3
Name : NAS:2
Creation Time : Sat Jan 14 16:49:14 2012
Raid Level : raid5
Raid Devices : 5
Avail Dev Size : 1944080833 (927.01 GiB 995.37 GB)
Array Size : 7776322048 (3708.04 GiB 3981.48 GB)
Used Dev Size : 1944080512 (927.01 GiB 995.37 GB)
Data Offset : 2048 sectors
Super Offset : 8 sectors
State : clean
Device UUID : e046df58:28bb1715:160ed2d5:6e2aae94
Update Time : Fri Aug 24 22:00:11 2018
Checksum : 2810ff0a - correct
Events : 2354905
Layout : left-symmetric
Chunk Size : 64K
Device Role : Active device 4
Array State : AAAAA ('A' == active, '.' == missing)
--
Adam Goryachev Website Managers www.websitemanagers.com.au
--
The information in this e-mail is confidential and may be legally privileged.
It is intended solely for the addressee. Access to this e-mail by anyone else
is unauthorised. If you are not the intended recipient, any disclosure,
copying, distribution or any action taken or omitted to be taken in reliance
on it, is prohibited and may be unlawful. If you have received this message
in error, please notify us immediately. Please also destroy and delete the
message from your computer. Viruses - Any loss/damage incurred by receiving
this email is not the sender's responsibility.