On 7/9/20 7:23 PM, Eric Sandeen wrote:
On 7/9/20 4:27 PM, Eric Sandeen wrote:
On 7/9/20 3:32 PM, Davide Cavalca via devel wrote:
...
As someone on one of the teams at FB that has to deal with that, I can
assure you all the scenarios you listed can and do happen, and they
happen a lot. While we don't have the "laptop's out of battery" issue
on the production side, we have plenty of power events and unplanned
maintenances that can and will hit live machines and cut power off.
Force reboots (triggered by either humans or automation) are also not
at all uncommon. Rebuilding machines from scratch isn't free, even with
all the automation and stuff we have, so if power loss or reboot events
on machines using btrfs caused widespread corruption or other issues
I'm confident we'd have found that out pretty early on.
It is a bare minimum expectation that filesystems like btrfs, ext4, and xfs
do not suffer filesystem corruptions and inconsistencies due to reboots
and power losses.
So for the record I am in no way insinuating that btrfs is less crash-safe
than other filesystems (though I have not tested that, so if I have time
I'll throw that into the mix as well.)
So, we already have those tests in xfstests, and I put btrfs through a few
loops. This is generic/475:
# Copyright (c) 2017 Oracle, Inc. All Rights Reserved.
#
# FS QA Test No. 475
#
# Test log recovery with repeated (simulated) disk failures. We kick
# off fsstress on the scratch fs, then switch out the underlying device
# with dm-error to see what happens when the disk goes down. Having
# taken down the fs in this manner, remount it and repeat. This test
# is a Good Enough (tm) simulation of our internal multipath failure
# testing efforts.
It fails within 2 loops. Is it a critical failure? I don't know; the
test looks for unexpected things in dmesg, and perhaps the filter is
wrong. But I see stack traces during the run, and message like:
[689284.484258] BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted
Yeah, because dm-error throws EIO, and thus we abort the transaction, which
results in an EUCLEAN if you run fsync. This is a scary sounding message, but
its _exactly_ what's expected from generic/475. I've been running this in a
loop for an hour and the thing hasn't failed yet. There's all sorts of scary
messages
[17929.939871] BTRFS warning (device dm-13): direct IO failed ino 261 rw 1,34817
sector 0xb8ce0 len 24576 err no 10
[17929.943099] BTRFS: error (device dm-13) in btrfs_commit_transaction:2323:
errno=-5 IO failure (Error while writing out transaction)
again, totally expected because we're forcing EIO's at random times.
so I can't say for sure.
Are btrfs devs using these tests to assess crash/powerloss resiliency
on a regular basis? TBH I honestly did not expect to see any test
failures here, whether or not they are test artifacts; any filesystem
using xfstests as a benchmark needs to be keeping things up to date.
It depends on the config options. Some of our transaction abort sites dump
stack, and that trips the dmesg filter, and thus it fails. Generally when I run
this test I turn those options off.
This test is run constantly by us, specifically because it's the error cases
that get you. But not for crash consistency reasons, because we're solid there.
I run them to make sure I don't have stupid things like reference leaks or
whatever in the error path. Thanks,
Josef
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx