Re: Extreme startup delay on F34

John Mellor <john.mellor@xxxxxxxxx> · Sun, 15 Aug 2021 20:30:57 -0400

On 2021-07-28 1:12 a.m., Chris Murphy wrote:
On Tue, Jul 27, 2021 at 3:36 PM old sixpack13 <sixpack13@xxxxxxxxx> wrote:

...
is your GPU from intel ?
if so:
- I get it too, sometimes while browsing with FF.
- Crtl+Alt+F3 to get a console (?) and do dmesg => ...GPU Crash dump ... GPU hang...

+++ EDIT +++
I should have read the first thread again: it's an Intel GPU.

anyway, after Crtl+Alt+F3 you should be able to do
"sync && sync && sudo systemctl reboot"

saves the headache about an possible (?) brtfs filesystem corruption when doing a "hardcore power off"
IIRC, a brtfs scrub ... afterwards could help

There shouldn't be such a thing as file system corruption following
forced power off. It's sufficiently well tested on ext4, xfs, and
btrfs that if there's corruption, it's almost certainly a drive
firmware bug getting write order wrong by not honoring flush/FUA when
it should.

Btrfs has a bit of an advantage in these cases because it's got a
pretty simple update order:

data + metadata -> flush/FUA -> superblock -> flush/FUA

So in theory, the superblock only points to trees that are definitely
valid. All changes, data and metadata get written into free space
(copy-on-write, no overwrites), and therefore the worst case is data
being written is simply lost during a crash because a superblock
update didn't happen before the crash. A superblock that points to
bad/stale/missing trees means a new superblock made it to disk before
the metadata, metadata was lost. That's a firmware bug. We know that
because there's asstrometric amounts of tests done on all the file
systems, including btrfs, using xfstests. And a number of those tests
use dm-log-writes which expressly test for proper write ordering by
the file system.

Even in case of such a firmware bug, Btrfs can sometimes recover by
mounting with:

mount -o usebackuproot
mount -o rescue=usebackuproot

(same thing)

This picks an older root to mount instead of the one the super says
should be the most recent. But this still implies the drive firmware
did something wrong.

btrfs scrub checks integrity, it compares the information in a data
and metadata blocks with the checksum for that block; this can only be
done with the file system mounted

btrfs check checks the consistency of the file system, it's a metadata
only check but it's not just checking that there's a checksum match
but is it correct; the file system needs to be unmounted.

There's also the write time and read time tree checkers. Not
everything is included in these checks but it does catch certain kinds
of corruption at either read time (it's already happened and on disk
so let's stop here and no make it worse), or write time (it's not yet
on disk, let's stop here). Common cause of write time tree check
errors are memory bit flips, but also sometimes kernel bugs and even
btrfs bugs. I guess you could call it a nascent online fsck, but
without repair capability. Currently it flips the file system
read-only to stop further confusion and keep data safe.

Just an update:

The type of failure that I consistently kept seeing in the logs was 
illegal addressing in high memory during shutdown.  The failing address 
always looked like a couple of stuck high address bits, which was 
unlikely with a well-tested processor, motherboard and memory.  
Sometimes that was not fatal, while other times it cause a hard crash or 
a freeze.  Repeated memory and motherboard testing has shown no fault, 
and I've replace everything else multiple times.  This kind of fault is 
really tough to diagnose. Upon a lot of reflection in the middle of more 
than a few nights, I upgraded the BIOS and ACPI firmware on this Lenovo 
P300 from the 2016 version to the latest June/2020 version.  Since then 
I have not experienced a failure in 3 weeks of daily use with daily 
shutdowns.

As an aside, Lenovo does not make it easy to upgrade unless you are on 
Windows -- something for the Lenovo folks to rectify in their firmware 
distribution site.  Thankfully the stand-alone DVD image did the trick.

This outcome is a huge relief, but would seem to show that something in 
the ACPI code is not checking for valid addresses. The newer ACPI 
firmware is no longer providing bogus addresses, and the problem has 
gone away -- something for the kernel people to think about and identify 
the faulty/missing checks.

I'm still waiting and watching, but the machine is now highly stable 
when running Fedora.

--

John Mellor
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure