Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/9/20 8:22 PM, Josef Bacik wrote:
> On 7/9/20 7:23 PM, Eric Sandeen wrote:
>> On 7/9/20 4:27 PM, Eric Sandeen wrote:
>>> On 7/9/20 3:32 PM, Davide Cavalca via devel wrote:
>>
>> ...
>>
>>>> As someone on one of the teams at FB that has to deal with that, I can
>>>> assure you all the scenarios you listed can and do happen, and they
>>>> happen a lot. While we don't have the "laptop's out of battery" issue
>>>> on the production side, we have plenty of power events and unplanned
>>>> maintenances that can and will hit live machines and cut power off.
>>>> Force reboots (triggered by either humans or automation) are also not
>>>> at all uncommon. Rebuilding machines from scratch isn't free, even with
>>>> all the automation and stuff we have, so if power loss or reboot events
>>>> on machines using btrfs caused widespread corruption or other issues
>>>> I'm confident we'd have found that out pretty early on.
>>>
>>> It is a bare minimum expectation that filesystems like btrfs, ext4, and xfs
>>> do not suffer filesystem corruptions and inconsistencies due to reboots
>>> and power losses.
>>>
>>> So for the record I am in no way insinuating that btrfs is less crash-safe
>>> than other filesystems (though I have not tested that, so if I have time
>>> I'll throw that into the mix as well.)
>>
>> So, we already have those tests in xfstests, and I put btrfs through a few
>> loops.  This is generic/475:
>>
>> # Copyright (c) 2017 Oracle, Inc.  All Rights Reserved.
>> #
>> # FS QA Test No. 475
>> #
>> # Test log recovery with repeated (simulated) disk failures.  We kick
>> # off fsstress on the scratch fs, then switch out the underlying device
>> # with dm-error to see what happens when the disk goes down.  Having
>> # taken down the fs in this manner, remount it and repeat.  This test
>> # is a Good Enough (tm) simulation of our internal multipath failure
>> # testing efforts.
>>
>> It fails within 2 loops.  Is it a critical failure? I don't know; the
>> test looks for unexpected things in dmesg, and perhaps the filter is
>> wrong.  But I see stack traces during the run, and message like:
>>
>> [689284.484258] BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted

You might want to change that message, then.  If it's not corrupted, I'd suggest not doing printk("corrupted!") because that will make people think that it's corrupted, because it says "Filesystem corrupted..." ;)

> 
> Yeah, because dm-error throws EIO, and thus we abort the transaction, which results in an EUCLEAN if you run fsync.  This is a scary sounding message, but its _exactly_ what's expected from generic/475.  I've been running this in a loop for an hour and the thing hasn't failed yet.  There's all sorts of scary messages

That's weird.  The test fails very quickly for me - again, AFAICT it fails due to things in dmesg that aren't recognized as safe by the test harness, but a variety of things - not just stack dumps - seem to trigger the failure.

> [17929.939871] BTRFS warning (device dm-13): direct IO failed ino 261 rw 1,34817 sector 0xb8ce0 len 24576 err no 10
> [17929.943099] BTRFS: error (device dm-13) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction)
> 
> again, totally expected because we're forcing EIO's at random times.

Right, of course it will get IO errors, that's why I didn't highlight those in my email.

>> so I can't say for sure.
>>
>> Are btrfs devs using these tests to assess crash/powerloss resiliency
>> on a regular basis?  TBH I honestly did not expect to see any test
>> failures here, whether or not they are test artifacts; any filesystem
>> using xfstests as a benchmark needs to be keeping things up to date.
>>
> 
> It depends on the config options.  Some of our transaction abort sites dump stack, and that trips the dmesg filter, and thus it fails.  Generally when I run this test I turn those options off.

It would be good, in general, to fix up the test for btrfs so that it does not yield false positives, if that's what this is.  Otherwise it trains people to ignore it or not run it....

> This test is run constantly by us, specifically because it's the error cases that get you.  But not for crash consistency reasons, because we're solid there.  I run them to make sure I don't have stupid things like reference leaks or whatever in the error path.  Thanks,

or "corrupted!" printk()s that terrify the hapless user? ;)

-Eric
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Fedora Announce]     [Fedora Users]     [Fedora Kernel]     [Fedora Testing]     [Fedora Formulas]     [Fedora PHP Devel]     [Kernel Development]     [Fedora Legacy]     [Fedora Maintainers]     [Fedora Desktop]     [PAM]     [Red Hat Development]     [Gimp]     [Yosemite News]

  Powered by Linux