On Thu, Jun 30, 2016 at 1:40 PM, Richard Shaw <hobbes1069@xxxxxxxxx> wrote: > Ok, I think I was able to catch the error without having to get that > desparate :) > > I logged in from work and did journalctl -f and got the following: > > http://pastebin.com/3JAL297z > > Looks disk hardware related but it's not getting worse so I doubt it's the > disk itself, driver problem instead? Jun 30 14:11:38 ladyhobbes kernel: ata1.00: exception Emask 0x0 SAct 0x800 SErr 0x50000 action 0x6 frozen Jun 30 14:12:38 ladyhobbes kernel: ata1: SError: { PHYRdyChg CommWake } Jun 30 14:12:38 ladyhobbes kernel: ata1.00: failed command: WRITE FPDMA QUEUED Jun 30 14:12:38 ladyhobbes kernel: ata1.00: cmd 61/20:58:00:80:3b/01:00:0e:00:00/40 tag 11 ncq 147456 out res 40/00:37:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Jun 30 14:12:38 ladyhobbes kernel: ata1.00: status: { DRDY } Jun 30 14:12:38 ladyhobbes kernel: ata1: hard resetting link It's the drive itself. The drive is failing to respond on a write command, hangs, exceeds the SCSI command timer, which then times out and hard resets the link. This happens several times, and ext4 needs to write its journal superblock, can't, gets pissed, and gives up and goes read only. So the short version is the drive needs to be replaced. Write failures are always disqualifying. If this were an md software raid device, md would immediately mark the drive faulty on a single one of these kinds of errors. Due to arguably flawed engineering, the drive is apparently taking more than 30 seconds to figure out it can't write to this sector, which is pretty messed up. And then the Linux kernel default is arguably too short so it's giving up before we get a discrete proper error message from the drive as to what's going on. This is now a common misconfiguration that's actually quite bad, but itself an edge case since read errors are so rare and write errors are even more rare. Anyway if you really want to play with this more, you can do: smartctl -l scterc /dev/sdX ## this will reveal the SCT support and setting for the drive which must always be shorter than the kernel's command timer. I spect this to be disabled which means the value is unknown but could be as high as 180 seconds. cat /sys/block/sdX/device/timeout ##this will reveal the kernel command timer, which by default is 30 so I expect it to be 30. Proper configuration means the drive gives up on errors before the kernel does, i.e. the first value needs to be less than the second. Pretty much no one has this unless they're using enterprise or NAS drives. So diplomatically this situation I'd call totally fucked. Undiplomatically, well, that involves drinking games first. For the latest in this saga, I posted this a few days ago on linux-raid@ which is upstream for all things Linux RAID but in particular md, which ends up being the hardest hit by this problem as it eventually results in things like total raid5 (even raid6) collapse when it should be able to survive. http://marc.info/?l=linux-raid&m=146704573129021&w=2 And yes, I understand this thread involves one drive, but the misconfiguration is a problem there too because manufacturers expect consumer drives to have these "deep" recoveries for marginally bad sectors, that can take (seriously) upwards of 3 minutes to sort out, during which time the drive is unresponsive. And right now Linux will have none of that, and just resets the drive. That's solvable for read errors by increasing the kernel command timer. It's not solvable, probably, for write errors. I think if you increase the kernel command timer to 180 by using 'echo 180 > /sys...' what'll happen is you'll just get a discrete write error from the drive eventually. So yeah, replace the drive. -- Chris Murphy -- users mailing list users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe or change subscription options: https://lists.fedoraproject.org/admin/lists/users@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: http://fedoraproject.org/code-of-conduct Guidelines: http://fedoraproject.org/wiki/Mailing_list_guidelines Have a question? Ask away: http://ask.fedoraproject.org