Re: Need help to recover root filesystem after a power supply issue

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Wed, 10 Jul 2019 12:03:11 -0600

On Wed, Jul 10, 2019 at 11:16 AM Andrey Zhunev <a-j@xxxxxx> wrote:
>
> Wednesday, July 10, 2019, 7:47:55 PM, you wrote:
>
> > On Wed, Jul 10, 2019 at 10:46 AM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> # smartctl -l scterc,900,100
> >> # echo 180 > /sys/block/sda/device/timeout
>
>
> > smartctl command above does need a drive specified...
>
> Indeed! :)
>
> With the commands above, you are increasing the timeout and then fsck
> will try to re-read the sectors, right?

More correctly, the drive firmware won't timeout, and will try longer
to recover the data *if* the sectors are marginally bad. If the
sectors are flat out bad, then the firmware will still (almost)
immediately give up and at that point nothing else can be done except
zero the bad sectors and hope fsck can reconstruct what's missing.

Thing is, 68 sectors has a low likelihood of impacting fs metadata,
because it's a smaller target than your actual data, or free space if
there's a lot of it.

> As for the SMART status, the number of pending sectors was 0 before.
> It started to grow after the PSU incident yesterday. Now, since I'm
> doing a ddrescue, all the sectors will be read (or attempted to be
> read). So the pending sectors counter may increase further.

It's a good and valid tactic to just use ddrescue with the previously
mentioned modifications for SCT ERC and kernel timeouts, rather than
directly use fsck on a drive that's clearly dying.

> As I understand, when a drive cannot READ a sector, the sector is
> reported as pending. And it will stay like that until either the
> sector is finally read or until it is overwritten. When either of
> these happens, the Pending Sector Counter should decrease.

Sounds about right.

> In theory, it can go back to 0 (although I didn't follow this closely
> enough, so I never saw a drive like that).

It can and should go to zero once all the pending sectors are
overwritten with either good data or zeros. It's possible the write
succeeds to the same sector, in which case it's no longer pending and
not remapped. It's possible internally the write fails and the drive
firmware does a remap to make the write succeed, in which case it's no
longer pending.

If a write fails (externally reported write failure to the kernel),
then pending sectors will get pinned at that point and only ever go up
as the drive continues to get worse.

> If a drive can't WRITE to a sector, it tries to reallocate it. If it
> succeeds, Reallocated Sectors Counter is increased. If it fails to
> reallocate - I guess there should be another kind of error or a
> counter, but I'm not sure which one.

You get essentially the same UNC type of error you've seen except it's
a write error instead of read. That's widely considered fatal because
having a drive that can't write is just not usable for anything (well,
read only).

>
> When reallocated sectors appear - it's clearly a bad sign. If the
> number of reallocated sectors grow - the drive should not be used.
> But it's not that obvious for the pending sectors...

They're both bad news. It's just a matter of degree. Yes a
manufacturer probably takes the position that pending sectors is and
even remapping is normal drive behavior. But realistically it's not
something anyone wants to have to deal with. It's useful for
curiousity. Use it for Btrfs testing :-D

-- 
Chris Murphy