Re: Need help to recover root filesystem after a power supply issue

Andrey Zhunev <a-j@xxxxxx> · Wed, 10 Jul 2019 18:02:30 +0300

Wednesday, July 10, 2019, 5:23:41 PM, you wrote:

> On 7/10/19 8:58 AM, Andrey Zhunev wrote:
>> Wednesday, July 10, 2019, 4:26:14 PM, you wrote:
>> 
>>> On 7/10/19 4:56 AM, Andrey Zhunev wrote:
>>>> Hello All,
>>>>
>>>> I am struggling to recover my system after a PSU failure, and I was
>>>> suggested to ask here for support.
>>>>
>>>> One of the hard drives throws some read errors, and that happen to be
>>>> my root drive...
>>>> My system is CentOS 7, and the root partition is a part of LVM.
>>>>
>>>> [root@mgmt ~]# lvscan
>>>>   ACTIVE            '/dev/centos/root' [<98.83 GiB] inherit
>>>>   ACTIVE            '/dev/centos/home' [<638.31 GiB] inherit
>>>>   ACTIVE            '/dev/centos/swap' [<7.52 GiB] inherit
>>>> [root@mgmt ~]#
>>>>
>>>> [root@tftp ~]# file -s /dev/centos/root
>>>> /dev/centos/root: symbolic link to `../dm-3'
>>>> [root@tftp ~]# file -s /dev/centos/home
>>>> /dev/centos/home: symbolic link to `../dm-4'
>>>> [root@tftp ~]# file -s /dev/dm-3
>>>> /dev/dm-3: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
>>>> [root@tftp ~]# file -s /dev/dm-4
>>>> /dev/dm-4: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs)
>>>>
>>>>
>>>> [root@tftp ~]# xfs_repair /dev/centos/root
>>>> Phase 1 - find and verify superblock...
>>>> superblock read failed, offset 53057945600, size 131072, ag 2, rval -1
>>>>
>>>> fatal error -- Input/output error
>> 
>>> look at dmesg, see what the kernel says about the read failure.
>> 
>>> You might be able to use https://www.gnu.org/software/ddrescue/ 
>>> to read as many sectors off the device into an image file as possible,
>>> and that image might be enough to work with for recovery.  That would be
>>> my first approach:
>> 
>>> 1) use dd-rescue to create an image file of the device
>>> 2) make a copy of that image file
>>> 3) run xfs_repair -n on the copy to see what it would do
>>> 4) if that looks reasonable run xfs_repair on the copy
>>> 5) mount the copy and see what you get
>> 
>>> But if your drive simply cannot be read at all, this is not a filesystem
>>> problem, it is a hardware problem. If this is critical data you may wish
>>> to hire a data recovery service.
>> 
>>> -Eric
>> 
>> 
>> Hi Eric,
>> 
>> Thanks for your message!
>> I already started to copy the failing drive with ddrescue. This is a
>> large drive, so it takes some time to complete...
>> 
>> When I tried to run xfs_repair on the original (failing) drive, the
>> xfs_repair was unable to read the superblock and then just quitted
>> with an 'io error'.
>> Do you think it can behave differently on a copied image ?

> As I said, look at dmesg to see what failed on the original drive read
> attempt.

> ddrescue will fill unreadable sectors with 0, and then of course that
> can be read from the image file.

Ooops, I forgot to paste the error message from dmesg.
Here it is:

Jul 10 11:48:05 mgmt kernel: ata1.00: exception Emask 0x0 SAct 0x180000 SErr 0x0 action 0x0
Jul 10 11:48:05 mgmt kernel: ata1.00: irq_stat 0x40000008
Jul 10 11:48:05 mgmt kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 10 11:48:05 mgmt kernel: ata1.00: cmd 60/00:98:28:ac:3e/01:00:03:00:00/40 tag 19 ncq 131072 in#012         res 41/40:00:08:ad:3e/00:00:03:00:00/40 Emask 0x409 (media error) <F>
Jul 10 11:48:05 mgmt kernel: ata1.00: status: { DRDY ERR }
Jul 10 11:48:05 mgmt kernel: ata1.00: error: { UNC }
Jul 10 11:48:05 mgmt kernel: ata1.00: configured for UDMA/133
Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 Sense Key : Medium Error [current] [descriptor]
Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto reallocate failed
Jul 10 11:48:05 mgmt kernel: sd 0:0:0:0: [sda] tag#19 CDB: Read(16) 88 00 00 00 00 00 03 3e ac 28 00 00 01 00 00 00
Jul 10 11:48:05 mgmt kernel: blk_update_request: I/O error, dev sda, sector 54439176
Jul 10 11:48:05 mgmt kernel: ata1: EH complete

There are several of these.
At the moment ddrescue reports 22 read errors (with 35% of the data
copied to a new storage). If I remember correctly, the LVM with my
root partition is at the end of the drive. This means more errors will
likely come... :( 

The way I interpret the dmesg message, that's just a read error. I'm
not sure, but maybe a complete wipe of the drive will even overwrite /
clear these unreadable sectors.
Well, that's something to be checked after the copy process finishes.

---
Best regards,
 Andrey