v4.15 intermittent errors on suspend/resume

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Re-sent as text/plain, sorry]

To anyone waiting for the other show to drop on the SATA LPM work...

I've found something that's at least in the same area.  It triggered a fsck on my system 2 days ago.  Evidence suggests it's occurred on many other machines.  I felt that was reason enough to give you a heads up :).

I checked and I don't seem to have LPM enabled during runtime, even when running on battery.  My errors are all on suspend/resume, so maybe that behaviour was changed at the same time?

It doesn't always show in kernel logs.  What I first noticed was a mysterious SIGBUS that kills Xwayland (and hence the entire Gnome session) on resume from suspend.  It surprised me to learn that this SIGBUS can happen, without leaving anything like the read errors I'm used to seeing in the kernel log!

My coredumps show the SIGBUS fault address is an instruction read inside the program code of Xwayland.  The backtraces vary along the same call chain - the common factor is that they're always at the first instruction of the function.  I assume it varies according to which page is not currently in-core, and hence triggers the failing read request.

There are *hundreds* of backtraces along this same call chain from other users, reported automatically to Fedora, that look the same.  At least so far we don't have any more plausible for them. I admit it's funny that Xwayland is so prominent, and I haven't been swamped with SIGBUS in other processes, but I stand by this analysis.

These crashes started within 24 hours of Fedora upgrading to kernel v4.15.

Fedora bug for the Xwayland SIGBUS:
    https://bugzilla.redhat.com/show_bug.cgi?id=1553979

My duplicate bug I've been spamming with puzzled comments:
https://bugzilla.redhat.com/show_bug.cgi?id=1557682 <https://bugzilla.redhat.com/show_bug.cgi?id=1557682>

The earliest and biggest of the many crash report buckets:

[2018-02-17]https://retrace.fedoraproject.org/faf/reports/2049964/
[315 reports]https://retrace.fedoraproject.org/faf/reports/2055378/


EXT4 filesystem error:

Mar 27 11:28:30 alan-laptop kernel: PM: suspend exit
...
Mar 27 11:28:30 alan-laptop kernel: EXT4-fs error (device dm-2): ext4_find_entry:1436: inode #5514052: comm thunderbird: reading directory lblock 0
Mar 27 11:28:30 alan-laptop kernel: Buffer I/O error on dev dm-2, logical block 0, lost sync page write
(this marked the FS

More frequently, it logs these swap errors:

Mar 02 18:47:03 alan-laptop kernel: Restarting tasks ...
Mar 02 18:47:03 alan-laptop kernel: Read-error on swap-device (253:1:836184)
Mar 02 18:47:06 alan-laptop kernel: Read-error on swap-device (253:1:580280)


My laptop LPM status, even after removing AC power:

$ head /sys/class/scsi_host/host*/link_power_management_policy
==> /sys/class/scsi_host/host0/link_power_management_policy <==
max_performance

==> /sys/class/scsi_host/host1/link_power_management_policy <==
max_performance


My laptop is a Dell Lattitude E5450.  CPU is i5-5300U (a Broadwell).

My full kernel and SIGBUS logs for this year (thanks, journalctl):

    https://www.dropbox.com/s/vwo9aj7rq389zki/2018-kernel-and-sigbus.log.gz

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystems]     [Linux SCSI]     [Linux RAID]     [Git]     [Kernel Newbies]     [Linux Newbie]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Samba]     [Device Mapper]

  Powered by Linux