On 2021-07-28 1:12 a.m., Chris Murphy wrote:
On Tue, Jul 27, 2021 at 3:36 PM old sixpack13 <sixpack13@xxxxxxxxx> wrote:
...
is your GPU from intel ?
if so:
- I get it too, sometimes while browsing with FF.
- Crtl+Alt+F3 to get a console (?) and do dmesg => ...GPU Crash dump ... GPU hang...
+++ EDIT +++
I should have read the first thread again: it's an Intel GPU.
anyway, after Crtl+Alt+F3 you should be able to do
"sync && sync && sudo systemctl reboot"
saves the headache about an possible (?) brtfs filesystem corruption when doing a "hardcore power off"
IIRC, a brtfs scrub ... afterwards could help
There shouldn't be such a thing as file system corruption following
forced power off. It's sufficiently well tested on ext4, xfs, and
btrfs that if there's corruption, it's almost certainly a drive
firmware bug getting write order wrong by not honoring flush/FUA when
it should.
Btrfs has a bit of an advantage in these cases because it's got a
pretty simple update order:
data + metadata -> flush/FUA -> superblock -> flush/FUA
So in theory, the superblock only points to trees that are definitely
valid. All changes, data and metadata get written into free space
(copy-on-write, no overwrites), and therefore the worst case is data
being written is simply lost during a crash because a superblock
update didn't happen before the crash. A superblock that points to
bad/stale/missing trees means a new superblock made it to disk before
the metadata, metadata was lost. That's a firmware bug. We know that
because there's asstrometric amounts of tests done on all the file
systems, including btrfs, using xfstests. And a number of those tests
use dm-log-writes which expressly test for proper write ordering by
the file system.
Even in case of such a firmware bug, Btrfs can sometimes recover by
mounting with:
mount -o usebackuproot
mount -o rescue=usebackuproot
(same thing)
This picks an older root to mount instead of the one the super says
should be the most recent. But this still implies the drive firmware
did something wrong.
btrfs scrub checks integrity, it compares the information in a data
and metadata blocks with the checksum for that block; this can only be
done with the file system mounted
btrfs check checks the consistency of the file system, it's a metadata
only check but it's not just checking that there's a checksum match
but is it correct; the file system needs to be unmounted.
There's also the write time and read time tree checkers. Not
everything is included in these checks but it does catch certain kinds
of corruption at either read time (it's already happened and on disk
so let's stop here and no make it worse), or write time (it's not yet
on disk, let's stop here). Common cause of write time tree check
errors are memory bit flips, but also sometimes kernel bugs and even
btrfs bugs. I guess you could call it a nascent online fsck, but
without repair capability. Currently it flips the file system
read-only to stop further confusion and keep data safe.
Just an update:
The type of failure that I consistently kept seeing in the logs was
illegal addressing in high memory during shutdown. The failing address
always looked like a couple of stuck high address bits, which was
unlikely with a well-tested processor, motherboard and memory.
Sometimes that was not fatal, while other times it cause a hard crash or
a freeze. Repeated memory and motherboard testing has shown no fault,
and I've replace everything else multiple times. This kind of fault is
really tough to diagnose. Upon a lot of reflection in the middle of more
than a few nights, I upgraded the BIOS and ACPI firmware on this Lenovo
P300 from the 2016 version to the latest June/2020 version. Since then
I have not experienced a failure in 3 weeks of daily use with daily
shutdowns.
As an aside, Lenovo does not make it easy to upgrade unless you are on
Windows -- something for the Lenovo folks to rectify in their firmware
distribution site. Thankfully the stand-alone DVD image did the trick.
This outcome is a huge relief, but would seem to show that something in
the ACPI code is not checking for valid addresses. The newer ACPI
firmware is no longer providing bogus addresses, and the problem has
gone away -- something for the kernel people to think about and identify
the faulty/missing checks.
I'm still waiting and watching, but the machine is now highly stable
when running Fedora.
--
John Mellor
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure