Re: BTRFS partition corrupted after deleting files in /home

"George N. White III" <gnwiii@xxxxxxxxx> · Wed, 13 Jan 2021 09:01:52 -0400

On Wed, 13 Jan 2021 at 05:41, Sreyan Chakravarty <sreyan32@xxxxxxxxx> wrote:
On Tue, Jan 12, 2021 at 9:16 AM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:

>

>

> -x has more information that might be relevant including firmware

> revision and some additional logs for recent drive reported errors

> which usually are benign. But might be clues.

>

> These two attributes I'm not familiar with

> 187 Reported_Uncorrect      0x0032   100   096   000    Old_age

> Always       -       4294967301

> 188 Command_Timeout         0x0032   100   100   000    Old_age

> Always       -       98785820672

>

> But the value is well above threshold for both so I'm not worried about it.

>

>

Here is the output of:

# smartctl -Ax /dev/sda

https://pastebin.com/raw/GrgrQrSf

I have no idea what it means.

You are not alone.    Most people stop reading at the 
line: 
SMART overall-health self-assessment test result: PASSED
Before retiring I worked in remote sensing, which is a data-intensive 
activity.   HDD failures were a major issue.   One sure way to kill a 
drive was to start a batch job that filled a disk and then kept hammering
the drive over a long weekend when I was off somewhere without network
access.   I could usually get warranty replacements for failed drives by 
submitting the smartctrl reports.  We use XFS starting on SGI IRIX and
then on linux when it became available, with striped arrays for 
thruput with I/O bound processes.  XFS was designed to avoid lengthy
filesystem repair times, so getting a system back after a drive failure
just meant waiting for the tape robot to find and restore the backup tapes.

HDD's are mechanical so subject to wear.  With heavy use they tend to die
 shortly after end-or-warranty.    I started replacing drives at 
end-or-warranty 
which, along with measures to reduce runaway batch jobs, greatly reduced
the number of failures.  Your drive has been used for 1671 hours, and 
1491 power-on cycles.   Mechanical device wear is often highest at startup,
so this is probably getting close to the design lifetime of a consumer laptop
HDD.

There are workloads (image processing, numerical modelling) where recovering
the work done since the last backup just means restarting a batch job and is 
generally easier than trying to repair a filesystem with a bunch of partially written 
HDF5 files.   

Given the age of your HDD, I would replace it.   If your laptop came with Windows,
you should be able to install Windows 10 on a small partition in order to upgrade the
BIOS and maybe run the drive vendor's diagnostics.   You may want to revisit your
choices of drive technology, filesystem, backup and recovery strategy, etc. with
your use case in mind.   

This is the problem with SMART tests, they are so esoteric that it is

difficult for a common user to make sense of it.

Let me know what you think, if you see any glaring faults.

You are to be commended for helping the btrfs developers investigate one of the 
rare situations that make filesystems such a hard problem.   My experience indicates
your HDD is involved, either by old age or some BIOS or drive firmware glitch, so
your best way forward is to make sure your BIOS is current and replace the drive
with one suited to your use case.

-- 
George N. White III

_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx