Re: Intel Core i5-6200U laptop issue

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Fri, 1 Jul 2016 10:09:55 -0600

On Thu, Jun 30, 2016 at 1:40 PM, Richard Shaw <hobbes1069@xxxxxxxxx> wrote:
> Ok, I think I was able to catch the error without having to get that
> desparate :)
>
> I logged in from work and did journalctl -f and got the following:
>
> http://pastebin.com/3JAL297z
>
> Looks disk hardware related but it's not getting worse so I doubt it's the
> disk itself, driver problem instead?

Jun 30 14:11:38 ladyhobbes kernel: ata1.00: exception Emask 0x0 SAct
0x800 SErr 0x50000 action 0x6 frozen
Jun 30 14:12:38 ladyhobbes kernel: ata1: SError: { PHYRdyChg CommWake }
Jun 30 14:12:38 ladyhobbes kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jun 30 14:12:38 ladyhobbes kernel: ata1.00: cmd
61/20:58:00:80:3b/01:00:0e:00:00/40 tag 11 ncq 147456 out
                                            res
40/00:37:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Jun 30 14:12:38 ladyhobbes kernel: ata1.00: status: { DRDY }
Jun 30 14:12:38 ladyhobbes kernel: ata1: hard resetting link

It's the drive itself. The drive is failing to respond on a write
command, hangs, exceeds the SCSI command timer, which then times out
and hard resets the link. This happens several times, and ext4 needs
to write its journal superblock, can't, gets pissed, and gives up and
goes read only.

So the short version is the drive needs to be replaced. Write failures
are always disqualifying. If this were an md software raid device, md
would immediately mark the drive faulty on a single one of these kinds
of errors.

Due to arguably flawed engineering, the drive is apparently taking
more than 30 seconds to figure out it can't write to this sector,
which is pretty messed up. And then the Linux kernel default is
arguably too short so it's giving up before we get a discrete proper
error message from the drive as to what's going on. This is now a
common misconfiguration that's actually quite bad, but itself an edge
case since read errors are so rare and write errors are even more
rare.

Anyway if you really want to play with this more, you can do:

smartctl -l scterc /dev/sdX  ## this will reveal the SCT support and
setting for the drive which must always be shorter than the kernel's
command timer. I spect this to be disabled which means the value is
unknown but could be as high as 180 seconds.

cat /sys/block/sdX/device/timeout  ##this will reveal the kernel
command timer, which by default is 30 so I expect it to be 30.

Proper configuration means the drive gives up on errors before the
kernel does, i.e. the first value needs to be less than the second.
Pretty much no one has this unless they're using enterprise or NAS
drives.  So diplomatically this situation I'd call totally fucked.
Undiplomatically, well, that involves drinking games first.

For the latest in this saga, I posted this a few days ago on
linux-raid@ which is upstream for all things Linux RAID but in
particular md, which ends up being the hardest hit by this problem as
it eventually results in things like total raid5 (even raid6) collapse
when it should be able to survive.
http://marc.info/?l=linux-raid&m=146704573129021&w=2

And yes, I understand this thread involves one drive, but the
misconfiguration is a problem there too because manufacturers expect
consumer drives to have these "deep" recoveries for marginally bad
sectors, that can take (seriously) upwards of 3 minutes to sort out,
during which time the drive is unresponsive. And right now Linux will
have none of that, and just resets the drive. That's solvable for read
errors by increasing the kernel command timer. It's not solvable,
probably, for write errors. I think if you increase the kernel command
timer to 180 by using 'echo 180 > /sys...' what'll happen is you'll
just get a discrete write error from the drive eventually.

So yeah, replace the drive.

-- 
Chris Murphy
--
users mailing list
users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe or change subscription options:
https://lists.fedoraproject.org/admin/lists/users@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct
Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines
Have a question? Ask away: http://ask.fedoraproject.org