On Wed, 13 Jan 2021 at 05:41, Sreyan Chakravarty <sreyan32@xxxxxxxxx> wrote:
On Tue, Jan 12, 2021 at 9:16 AM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>
>
> -x has more information that might be relevant including firmware
> revision and some additional logs for recent drive reported errors
> which usually are benign. But might be clues.
>
> These two attributes I'm not familiar with
> 187 Reported_Uncorrect 0x0032 100 096 000 Old_age
> Always - 4294967301
> 188 Command_Timeout 0x0032 100 100 000 Old_age
> Always - 98785820672
>
> But the value is well above threshold for both so I'm not worried about it.
>
>
Here is the output of:
# smartctl -Ax /dev/sda
https://pastebin.com/raw/GrgrQrSf
I have no idea what it means.
You are not alone. Most people stop reading at the
line:
SMART overall-health self-assessment test result: PASSEDBefore retiring I worked in remote sensing, which is a data-intensive
activity. HDD failures were a major issue. One sure way to kill a
drive was to start a batch job that filled a disk and then kept hammering
the drive over a long weekend when I was off somewhere without network
access. I could usually get warranty replacements for failed drives by
submitting the smartctrl reports. We use XFS starting on SGI IRIX and
then on linux when it became available, with striped arrays for
thruput with I/O bound processes. XFS was designed to avoid lengthy
filesystem repair times, so getting a system back after a drive failure
just meant waiting for the tape robot to find and restore the backup tapes.
HDD's are mechanical so subject to wear. With heavy use they tend to die
shortly after end-or-warranty. I started replacing drives at
end-or-warranty
which, along with measures to reduce runaway batch jobs, greatly reduced
the number of failures. Your drive has been used for 1671 hours, and
1491 power-on cycles. Mechanical device wear is often highest at startup,
so this is probably getting close to the design lifetime of a consumer laptop
HDD.
There are workloads (image processing, numerical modelling) where recovering
the work done since the last backup just means restarting a batch job and is
generally easier than trying to repair a filesystem with a bunch of partially written
HDF5 files.
Given the age of your HDD, I would replace it. If your laptop came with Windows,
you should be able to install Windows 10 on a small partition in order to upgrade the
BIOS and maybe run the drive vendor's diagnostics. You may want to revisit your
choices of drive technology, filesystem, backup and recovery strategy, etc. with
your use case in mind.
This is the problem with SMART tests, they are so esoteric that it is
difficult for a common user to make sense of it.
Let me know what you think, if you see any glaring faults.
You are to be commended for helping the btrfs developers investigate one of the
rare situations that make filesystems such a hard problem. My experience indicates
your HDD is involved, either by old age or some BIOS or drive firmware glitch, so
your best way forward is to make sure your BIOS is current and replace the drive
with one suited to your use case.
George N. White III
_______________________________________________ users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx