Re: Extreme startup delay on F34

John Mellor <john.mellor@xxxxxxxxx> · Wed, 28 Jul 2021 09:07:54 -0400

On 2021-07-28 1:12 a.m., Chris Murphy wrote:
On Tue, Jul 27, 2021 at 3:36 PM old sixpack13 <sixpack13@xxxxxxxxx> wrote:
...
is your GPU from intel ?
if so:
- I get it too, sometimes while browsing with FF.
- Crtl+Alt+F3 to get a console (?) and do dmesg => ...GPU Crash dump ... GPU hang...
+++ EDIT +++
I should have read the first thread again: it's an Intel GPU.

anyway, after Crtl+Alt+F3 you should be able to do
"sync && sync && sudo systemctl reboot"

saves the headache about an possible (?) brtfs filesystem corruption when doing a "hardcore power off"
IIRC, a brtfs scrub ... afterwards could help
There shouldn't be such a thing as file system corruption following
forced power off. It's sufficiently well tested on ext4, xfs, and
btrfs that if there's corruption, it's almost certainly a drive
firmware bug getting write order wrong by not honoring flush/FUA when
it should.

Btrfs has a bit of an advantage in these cases because it's got a
pretty simple update order:

data + metadata -> flush/FUA -> superblock -> flush/FUA

So in theory, the superblock only points to trees that are definitely
valid. All changes, data and metadata get written into free space
(copy-on-write, no overwrites), and therefore the worst case is data
being written is simply lost during a crash because a superblock
update didn't happen before the crash. A superblock that points to
bad/stale/missing trees means a new superblock made it to disk before
the metadata, metadata was lost. That's a firmware bug. We know that
because there's asstrometric amounts of tests done on all the file
systems, including btrfs, using xfstests. And a number of those tests
use dm-log-writes which expressly test for proper write ordering by
the file system.

Even in case of such a firmware bug, Btrfs can sometimes recover by
mounting with:

mount -o usebackuproot
mount -o rescue=usebackuproot

(same thing)

This picks an older root to mount instead of the one the super says
should be the most recent. But this still implies the drive firmware
did something wrong.

btrfs scrub checks integrity, it compares the information in a data
and metadata blocks with the checksum for that block; this can only be
done with the file system mounted

btrfs check checks the consistency of the file system, it's a metadata
only check but it's not just checking that there's a checksum match
but is it correct; the file system needs to be unmounted.

There's also the write time and read time tree checkers. Not
everything is included in these checks but it does catch certain kinds
of corruption at either read time (it's already happened and on disk
so let's stop here and no make it worse), or write time (it's not yet
on disk, let's stop here). Common cause of write time tree check
errors are memory bit flips, but also sometimes kernel bugs and even
btrfs bugs. I guess you could call it a nascent online fsck, but
without repair capability. Currently it flips the file system
read-only to stop further confusion and keep data safe.

Hi Chris,

I can only describe my experiences.  On this Lenovo P300 machine, I have 
installed btrfs on a consumer drive, 2 enterprise drives, a matching 
pair of enterprise drives in btrfs RAID-1, and an ssd. All are 
single-ended SATA.  I have also replaced the locking short sata cables 
each time, and run the half day of extended BIOS tests to confirm that 
there is nothing wrong with the hardware.  I have even replaced the 
power supply twice, in case there was a problem getting flakey voltage.  
I literally do not have any more spare hardware to swap in.  All of them 
have experienced corruption after some weeks of usage.  I highly doubt 
that the problem is the firmware.

I also have a Lenovo T500 laptop that has experienced the same problem 
once, so I also doubt that it is a motherboard issue.

I have not tried to recover using that mount option.  Is it documented 
somewhere?

I agree that btrfs should be an advantage over ext4 in a fault 
scenario.  That just has not been my experience.  I've found ext4 to be 
an order of magnitude more reliable, albeit with a lot more seeking 
happening with the same workloads.  I'm back running on btrfs again at 
this point, and this week it is running as expected.  I'm waiting and 
watching in trepidation, after about 9 or 10 total reinstalls on my 
daily driver machine.

Is it possible that the uncorrected i915 faults are causing a kernel 
fault to prevent btrfs from keeping things sane?  To try to mitigate 
that problem, I've now installed an old AMD cedar card, so that the i915 
graphics are not used.

IMHO, coming from a lot of unix and bsd experiences, flipping the root 
filesystem to read-only is a very bad thing.  Even if it is corrupt, you 
need it to recover.  It would probably be better to work like ZFS and 
remount it using your nifty "usebackuproot" option, and get the machine 
up so that you can figure out what was lost instead of keeping it unusable.

--

John Mellor
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure