Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants

Martin Kolman <mkolman@xxxxxxxxxx> · Thu, 9 Jul 2020 17:50:41 -0400 (EDT)



----- Original Message -----
> From: "Josef Bacik" <josef@xxxxxxxxxxxxxx>
> To: devel@xxxxxxxxxxxxxxxxxxxxxxx
> Sent: Thursday, July 9, 2020 9:11:07 PM
> Subject: Re: Fedora 33 System-Wide Change proposal: Make btrfs the default file system for desktop variants
> 
> On 7/9/20 1:51 PM, Eric Sandeen wrote:
> > On 7/6/20 12:07 AM, Chris Murphy wrote:
> >> On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen <sandeen@xxxxxxxxxx>
> >> wrote:
> >>>
> >>> On 7/3/20 1:41 PM, Chris Murphy wrote:
> >>>> SSDs can fail in weird ways. Some spew garbage as they're
> >>>> failing, some go read-only. I've seen both. I don't have stats on
> >>>> how common it is for an SSD to go read-only as it fails, but once
> >>>> it happens you cannot fsck it. It won't accept writes. If it
> >>>> won't mount, your only chance to recover data is some kind of
> >>>> offline scrape tool. And Btrfs does have a very very good scrape
> >>>> tool, in terms of its success rate - UX is scary. But that can
> >>>> and will improve.
> >>>
> >>> Ok, you and Josef have both recommended the btrfs restore
> >>> ("scrape") tool as a next recovery step after fsck fails, and I
> >>> figured we should check that out, to see if that alleviates the
> >>> concerns about recoverability of user data in the face of
> >>> corruption.
> >>>
> >>> I also realized that mkfs of an image isn't representative of an
> >>> SSD system typical of Fedora laptops, so I added "-m single" to
> >>> mkfs, because this will be the mkfs.btrfs default on SSDs (right?).
> >>> Based on Josef's description of fsck's algorithm of throwing away
> >>> any block with a bad CRC this seemed worth testing.
> >>>
> >>> I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G
> >>> image, or a bit less than 1% of the filesystem blocks, at random.
> >>> This is 1/4 the fuzzing rate from the original test.
> >>>
> >>> So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair,
> >>> mount, mount w/ recovery, and then restore ("scrape") if all that
> >>> fails, see what we get.
> >>
> >> What's the probability of this kind of corruption occurring in the
> >> real world? If the probability is so low it can't practically be
> >> computed, how do we assess the risk? And if we can't assess risk,
> >> what's the basis of concern?
> > 
> >  From 20 years of filesystem development experience, I know that people
> > run filesystem repair tools.  It's just a fact.  For a wide variety of
> > reasons - from bugs, to hardware errors, to admin errors, you name it,
> > filesystems experience corruption and inconsistencies.  At that point
> > the administrator needs a path forward.
> > 
> > "people won't need to repair btrfs" is, IMHO, the position that needs
> > to be supported, not "filesystem repair tools should be robust."
> > 
> >>> I ran 50 loops, and got:
> >>>
> >>> 46 btrfsck failures 20 mount failures
> >>>
> >>> So it ran btrfs restore 20 times; of those, 11 runs lost all or
> >>> substantially all of the files; 17 runs lost at least 1/3 of the
> >>> files.
> >>
> >> Josef states reliability of ext4, xfs, and Btrfs are in the same
> >> ballpark. He also reports one case in 10 years in which he failed to
> >> recover anything. How do you square that with 11 complete failures,
> >> trivially produced? Is there even a reason to suspect there's
> >> residual risk?
> > 
> > Extrapolating from Facebook's usecases to the fedora desktop should be
> > approached with caution, IMHO.
> > 
> > I've provided evidence that if/when damage happens for whatever reason,
> > btrfs is unable to recover in place far more often than other filesytems.
> > 
> >> When metadata is single profile, Btrfs is basically an early warning
> >> system.> The available research on uncorrectable errors, errors that drive
> >> ECC
> >> does not catch, suggests that users are decently likely to experience
> >> at least one block of corruption in the life of the drive. And that
> >> it tends to get worse up until drive failure. But there is much less
> >> chance to detect this, if the file system isn't also checksumming the
> >> vastly larger payload on a drive: the data.
> > 
> > One of the problems in this whole discussion is the assumption that
> > filesystem
> > inconsistencies only arise from disk bitflips etc; that's just not the
> > case.
> > 
> > Look, I'm just providing evidence of what I've found when re-evaluating the
> > btrfs administration/repair tools.  I've found them to be quite weak.
> > 
> >  From what I've gathered from these responses, btrfs is unique in that it
> >  is
> > /expected/ that if anything goes wrong, the administrator should be
> > prepared
> > to scrape out remaining data, re-mkfs, and start over.  If that's
> > acceptable
> > for the Fedora desktop, that's fine, but I consider it a risk that should
> > not
> > be ignored when evaluating this proposal.
> > 
> 
> Agreed, it's the very first thing I said when I was asked what are the
> downsides.  There's clearly more work to be done in the recovery arena.  How
> often do disks fail for Fedora?  Do we have that data?  Is this a real risk?
> Nobody can say because Fedora doesn't have data.
We see installer bugs semi regularly, that turn out to be storage hardware failures
semi regularly (attached journal full of IO errors), so these do happen.
Unfortunately I'm afraid there is not an easy way to get a count of these specific bugs
over time...


> 
> Facebook does however have that data, and it's a microscopically small
> percentage.  I agree that Facebook is vastly different from Fedora from a
> recovery standpoint, but our workloads and hardware I think extrapolate to
> the
> normal Fedora user quite well.  We drive the disks harder than the normal
> Fedora
> user does of course, but in the end we're updating packages, taking
> snapshots,
> and building code.  We're just doing it at 1000x what a normal Fedora user
> does.
>   Thanks,
> 
> Josef
> _______________________________________________
> devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx
> 
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx