Re: maybe OT

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 27 Mar 2022 15:44:50 -0600

On Fri, Mar 18, 2022 at 4:47 PM Paolo Galtieri <pgaltieri@xxxxxxxxx> wrote:
>
> I'm having issues with a VM.
>
> The VM was originally created under VMware and has worked fine for a
> while.  Today when I booted it up instead of seeing the usual MATE login
> screen I get a login prompt:
>
> f34-01-vm:
>
> no matter what I enter, root or pgaltieri as login it never asks for
> password and immediately says login incorrect.  While it's booting I see
> several [FAILED]... messages, e.g. [FAILED] to start CUPS Scheduler
>
> I booted the system again and this time it dropped into emergency mode.
> In emergency mode I see the following messages in dmesg:
>
> BTRFS info (device sda2): flagging fs with big metadata feature
> BTRFS info (device sda2): disk space caching is enabled
> BTRFS info (device sda2): has skinny extents
> BTRFS info (device sda2): start tree-log replay
> BTRFS info (device sda2): parent transid verify failed on 61849600
> wanted 145639 fount 145637
> BTRFS info (device sda2): parent transid verify failed on 61849600
> wanted 145639 fount 145637
> BTRFS: error (device sda2) in btrfs_replay_log:2423 errno=-5 IO failure
> (Failed to recover log tree)
> BTRFS error (device sda2) open_ctree failed

That's not good. The tree-log is used during fsync as an optimization
to avoid having to do full file system metadata updates. Since the
tree-log exists, we know this file system was undergoing some fsync
write operations which were then interrupted. Either the VM or host
crashed, or one of them was forced to shutdown, or there's a bug that
otherwise prevented the guest operations from completing. Further, the
parent transid verification failure messages indicate some out of
order writes, as if the virtual drive+controller+cache is occasionally
ignoring flush/FUA requests.

I regularly use qemu-kvm VM with cache mode "unsafe". The VM can crash
all day long and at most I lose ~30s of the most recent writes,
depending on the fsync policy of the application doing the writes. But
the file system mounts normally otherwise following the crash. However
if the host crashes while the guest is writing, that file system can
be irreparably damaged. This is expected. So you might want to check
the cache policy being used, make sure that the guest VM is really
shutting down properly before rebooting/shutting down the host.

>
> I ran btrfs check in emergency mode and it came up with a lot of errors.
>
> How do i recover the partition(s) so I can boot the system, or at least
> mount them?

I'd start with
mount -o ro,nologreplay,rescue=usebackuproot

Followed by
mount -o ro,nologreplay,rescue=all

The second one is a bit of a heavy hammer but it's safe insofar as
it's mounting the fs read only and making no changes. It is also
disabling csum checking so any corrupt files still get copied out, and
without any corruption warnings. You can check man 5 btrfs to read a
bit more about the other options and vary the selection. This is
pretty much a recovery operation, i.e. get the important data out.

The repair comment for this particular set of errors:

btrfs rescue zero-log
btrfs check --repair --init-extent-tree
btrfs check --repair

I have somewhat low confidence that it can be repaired rather than
make things worse. So you should start out with the earlier mount
commands to get anything important out of the fs first. IF those don't
work and there's important information to get out, you need to use
btrfs restore.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure