Re: mdadm -> BTRFS conversion

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 1 Apr 2021 23:52:57 -0600

On Thu, Apr 1, 2021 at 5:53 AM Patrick O'Callaghan
<pocallaghan@xxxxxxxxx> wrote:
>
> On Wed, 2021-03-31 at 18:00 -0600, Chris Murphy wrote:
> > Nothing to add but the usual caveats:
> > https://raid.wiki.kernel.org/index.php/Timeout_Mismatch
>
> That´s pretty scary, though the drives I´m using are 1TB units
> scavenged from my extinct NAS so are unlikely to be SMR. They´re both
> WD model WD10EZEX drives.

It's not an SMR concern, it's making sure the drive gives up on errors
faster than the kernel tries to reset due to what it thinks is a
hanging drive.

smartctl -l scterc /dev/sdX

That'll tell you the default setting. I'm pretty sure Blues come with
SCT ERC disabled. Some support it. Some don't. If it's supported
you'll want to set it for something like 70-100 deciseconds (the units
SATA drives use for this feature).

And yeah, linux-raid@ list is chock full of such misconfigurations. It
filters out all the lucky people, and the unlucky people end up on the
list with a big problem which generally looks like this: one dead
drive, and one of the surviving drives with one bad sector that was
never fixed up through normal raid bad sector recovery mechanism,
because the kernel's default is to be impatient and do a link reset on
consumer drives that overthink a simple problem. Upon link reset, the
entire command queue in the drive is lost, and now there's no way to
know what sector it was hanging on, and no way for raid to do a fixup.
The fixup mechanism is, the drive reports an uncorrectable read error
with a sector address *only once it gives up*. And then the md raid
(and btrfs and zfs) can go lookup that sector, find out what data is
on it, go find its mirror, read the good data, and overwrite the bad
sector with good data. The overwrite is what fixes the problem.

If the drive doesn't support SCT ERC, we have to get the kernel to be
more patient. That's done via sysfs.

>
> > I use udev for that instead of init scripts. Concept is the same
> > though, you want SCT ERC time to be shorter than kernel's command
> > timer.
>
> I´ve been using MD for a while and haven´t seen any errors so far.

And you may never see it. Or you may end up being an unlucky person
with a raid who experiences complete loss of the array. When I say
comes up all the time on linux-raid@ list, it's about once every
couple of weeks. It's seen most often with raid5 because it has more
drives, thus more failures, than raid1 setups. And tolerates only one
failure *in a stripe*. Most everyone considers a failure a complete
drive failure, but drives also partially fail. Two drives partially
failing the sectors in the same stripe is pretty astronomical. But if
one drive dies, and *any* of the remaining drives has a bad sector
that can't be read, the entire stripe is lost. And depending on what's
in that stripe, it can bring down the array.

So, what you want is for the drives to report their errors, rather
than the kernel doing link resets.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure