Re: Adding a disk to existing BTRFS

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 23 Dec 2019 11:45:21 -0700

On Sun, Dec 22, 2019 at 12:52 AM Javier Perez <pepebuho@xxxxxxxxx> wrote:
>
> Hi
> My home partition is on a 2T HDD using btrfs
>
> I am reading the material at
> http://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices
> but still I am not that clear on some items.
>
> If I want to to add a second 2T drive to work as a mirror (RAID1) it looks like I do not have to invoke mdadm or anything similar, it seems like btrfs will handle it all internally. Am I understanding this right?

Correct.

>
> Also, before I add a new device, do I have to partition the drive or does btrfs take over all these duties (partitioning, formating) when it adds the device to the filesystem?

Partitioning is optional. Drives I dedicate for one task only, I do
not partition. If I use them for other things, or might use them for
other things, then I partition them.

The add command formats the new device and resizes the file system:
# btrfs device add /dev/sdX /mountpoint

The balance command with a convert filter changes the profile for
specified block groups, and does replication:
# btrfs balance start -dconvert=raid1 -mconvert=raid1 /mountpoint

> What has been the experience like with such a system?

Gotcha 1: applies to mdadm and LVM raid as well as Btrfs, is that it's
really common for mismatching drive SCT ERC and kernel SCSI block
command timer. That is, there is a drive error timeout and a kernel
block device error timeout. The drive's timeout must be less than the
kernel, or valuable information is lost that prevents self-healing,
allows bad sectors to accumulate, and eventually there will be data
loss. The thing is, the defaults are often wrong: consumer hard drives
often have very long SCT ERC, typically it's disabled, making for
really impressive timeouts in excess of 1 minute (some suggest it can
be 2 or 3 minutes), whereas the kernel command timeout is 30 seconds.
Ideally, use 'smartctl -l scterc' to set the SCT ERC to something like
7 seconds, this can also be set using a udev rule pointed to the
device by-id using serial number or wwn. You want the drive firmware
to give up on read errors quickly, that way it reports the bad
sector's LBA to the kernel, which in turn can find a good copy (raid1,
5, 6 or DUP profiles on Btrfs) and overwrite the bad sector thereby
fixing it. If the drive doesn't support SCT ERC, then you'll need to
increase the kernel's command timer. This is a kernel setting, but it
is per block device. And raise the value to something fairly
incredible, like 180 seconds. That means worst case scenario, a
marginally bad sector results in possibly a 3 minute hang until the
drive gives up, and reports a read error - and then it gets fixed up.

It seems esoteric, but really it's pernicious and common in the data
loss cases reported on linux-raid@ where they have the most experience
with RAID. But it applies the same to Btrfs.

More info here:
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

Gotcha 2, 3, 4:  Device failures mean multiple gotchas all at once, so
you kinda need a plan how to deal with this so you aren't freaking out
if it happens. Panic often leads to user induced data loss. If in
doubt, you are best off doing nothing and asking. Both linux-btrfs@
list and #btrfs on IRC freenode.net are approachable for this.

Gotcha: If a device dies, you're not likely to see any indication of
failure unless you're looking at kernel messages, and see a ton of
Btrfs complaints. Like, several scary red warnings *per* lost write.
If a drive dies, there will quickly be thousands of these. Whether you
do or don't notice this, the next time you reboot...

Gotcha: By default, Btrfs fails to mount if it can't find all devices.
This is because there are consequences to degraded operation, and it
requires user interaction to make sure its all resolved. But because
such mounts fail, there's a udev rule to wait for all Btrfs member
devices, that way small delays between multiple devices appearing,
don't result in failed mounts. But there's no timeout for this udev
rule, near as I can tell:

This is the rule
/usr/lib/udev/rules.d/64-btrfs.rules

So now you're stuck in this startup hang.

If it's just a case of the device accidentally missing, it's safe to
reconnect it, and then startup will proceed normally.

Otherwise, you need a way to get unstuck.

I'm improvising here, but what you want to do is remove the suspect
drive, (temporarily) disable this udev rule, so that it *will* try to
mount /home, and also you could change the fstab to add the "degraded"
option so that the mount attempt won't fail. Now at least you can boot
and work while degraded until you get a chance to really fix the
problem. A degraded /home operation isn't any more risky than a single
device /home - the consequences really are all in making sure it's put
back together correctly.

Ok so how to do all that? Either boot off a Live CD, inhibit the udev
rule, change fstab. Or you could boot your system with
rd.break=cmdline, mount root file system to /sysroot and make these
changes. Before rebooting, use 'btrfs filesystem show' to identify
which drive btrfs thinks is missing/bad and remove it.

You can use 'btrfs replace' or 'btrfs dev add' followed by 'btrfs dev
rem missing'; the first is preferred but you need to read all the man
pages about both methods so you're aware of whether or not you need to
do an fs resize; and use 'btrfs fi us /mount' to check usage for any
block groups that are not raid1. During degraded write, it's possible
some single copy data block groups are created, those need to be
manually converted to raid1 (yes you can have mixed replication levels
on btrfs).

And in the case where some degraded writes happen, and you get the
missing device reconnected, you'll use 'btrfs scrub' to get those
degraded writes replicated to the formerly missing device. That's not
automatic either.

A couple more gotchas to be aware of, which might be less bad with the
latest kernels, but without testing for it I wouldn't assume they're
fixed:
https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_volumes_only_mountable_once_RW_if_degraded
https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of_devices

Otherwise, btrfs raid1 is stable on stable hardware. It automatically
self heals if it finds problems during normal operation. And also
heals during scrubs. The gotchas only start if there's some kind of
problem. And then the challenge is to understand the exact nature of
the problem before taking action. Same issue with mdadm, and LVM raids
- just different gotchas and commands.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx