On Mon, 6 Jul 2020 at 01:19, Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote: > > On Fri, Jul 3, 2020 at 8:40 PM Eric Sandeen <sandeen@xxxxxxxxxx> wrote: > > > > On 7/3/20 1:41 PM, Chris Murphy wrote: > > > SSDs can fail in weird ways. Some spew garbage as they're failing, > > > some go read-only. I've seen both. I don't have stats on how common it > > > is for an SSD to go read-only as it fails, but once it happens you > > > cannot fsck it. It won't accept writes. If it won't mount, your only > > > chance to recover data is some kind of offline scrape tool. And Btrfs > > > does have a very very good scrape tool, in terms of its success rate - > > > UX is scary. But that can and will improve. > > > > Ok, you and Josef have both recommended the btrfs restore ("scrape") > > tool as a next recovery step after fsck fails, and I figured we should > > check that out, to see if that alleviates the concerns about > > recoverability of user data in the face of corruption. > > > > I also realized that mkfs of an image isn't representative of an SSD > > system typical of Fedora laptops, so I added "-m single" to mkfs, > > because this will be the mkfs.btrfs default on SSDs (right?). Based > > on Josef's description of fsck's algorithm of throwing away any > > block with a bad CRC this seemed worth testing. > > > > I also turned fuzzing /down/ to hitting 2048 bytes out of the 1G > > image, or a bit less than 1% of the filesystem blocks, at random. > > This is 1/4 the fuzzing rate from the original test. > > > > So: -m single, fuzz 2048 bytes of 1G image, run btrfsck --repair, > > mount, mount w/ recovery, and then restore ("scrape") if all that > > fails, see what we get. > > What's the probability of this kind of corruption occurring in the > real world? If the probability is so low it can't practically be > computed, how do we assess the risk? And if we can't assess risk, > what's the basis of concern? > Aren't most disk failure tests 'huh it somehow happened at least once and I think this explains all these other failures too?' I know that with giant clusters you can do more testing but you also have a lot of things like What is the chance that a disk will die over time? 100% What is the chance that a disk died from this particular scenario? 0.00000<maybe put a digit here> % reword the question slightly differently.. What is the chance this disk died from that scenario? 100%. For the HPC computers we had a score of Phd staticians coming up with all kinds of papers on disk failure modes which if asked in one way would come up with practically 0% odds it would happen. However all of the disk failures had happened at least once over a time frame... sometimes a short one, sometimes a long one, sometimes so often that someone had to retract a paper because it was clear that while the maths said it shouldn't happen .. it did in real life. <welcome to HPC at high altitudes.. cosmic rays, low air pressure, and dry air need to be factored in> -- Stephen J Smoogen. _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx