On Wed, Jul 10, 2019 at 10:08 AM Andrey Zhunev <a-j@xxxxxx> wrote: > > > Wednesday, July 10, 2019, 6:45:28 PM, you wrote: > > > On Wed, Jul 10, 2019 at 9:29 AM Andrey Zhunev <a-j@xxxxxx> wrote: > >> > >> Well, this machine is always online (24/7, with a UPS backup power). > >> Yesterday we found it switched OFF, without any signs of life. Trying > >> to switch it on, the PSU made a humming noise and the machine didn't > >> even try to start. So we replaced the PSU. After that, the machine > >> powered on - but refused to boot... Something tells me these two > >> failures are likely related... > > > Most likely the drive is dying and the spin down from power failure > > and subsequent spin up has increased the rate of degradation, and > > that's why they seem related. > > > What do you get for: > > > # smarctl -x /dev/sda > > > The '-x' option gives a lot of output. > It's pasted here: https://pastebin.com/raw/yW3yDuSF 197 Current_Pending_Sector -O--CK 200 200 000 - 68 > Well, if there are evidnces the drive is really dying - so be it... > I just need to recover the data, if possible. > On the other hand, if the drive will work further - I will find some > unimportant files to store... I think 68 pending sectors is excessive and I'd plan to have the drive replaced under warranty, or demote it to something you don't care about. Chances are this is going to get worse. I don't know how many reserve sectors drives have, I don't even have a guess. But I have seen drives run out of reserve sectors, at which point you start to see write failures because LBA's can't be remapped from a bad sector that fails writes, to a good one. At that point, the drive is untenable. Anyway, it's a bit tedious to fix 68 sectors manually, so if you have the time to just wait for it, try this: # smartctl -l scterc,900,100 # echo 180 > /sys/block/sda/device/timeout And now try to fsck. If it fails with i/o very quickly, as in less than 90 seconds, then that means the drive firmware has concluded deep recovery won't matter and is pretty much immediately giving up. At that point, those sectors are lost. You could overwrite those sectors one by one with zeros and maybe an xfs_repair will have enough information it can reconstruct and repair things well enough to copy data off. But you'll have to be suspicious of every file, as anyone of them could have been silently corrupted - either bad ECC reconstruction by drive firmware or from overwriting with zeros. I'd say there's a decent chance of recovery but it will be tedious. If it seems like the system is hanging without errors, that's actually a good sign deep recovery is working. But like I said, it could take hours. And then in the end it might still find a totally unrecoverable sector. -- Chris Murphy