Re: btrfs or ext4

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Wed, 30 Dec 2020 20:16:47 -0700

On Mon, Dec 28, 2020 at 2:33 AM François Patte
<francois.patte@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> Bonjour,
>
> I try to build a home nas to make a dlna server for audio, video and
> pictures.
>
> I have 2 disks for the data which I want to be mounted in raid1 (software).
>
> I formated the two disks using btrfs
> (mkfs.btrfs -m raid1 -d raid1 /dev/sda /dev/sdb)
>
> It works but, up to now, I can't see the advantages of this file system
> vs ext4 managed by mdadm.
>
> One disadvantage is that it seems that monitoring the system is not
> possible in case of disk failure for instance.

Btrfs in a raid1 configuration is significantly different than either
mdadm raid or single disk Btrfs. It will self-heal, unambiguously,
both metadata and data.

It detects corruption including bit rot, torn writes, misdirected
writes, even when the drive doesn't report any error. It finds the
good copy, and fixes up the bad copy. This happens passively during
normal use. The same repair principle applies when scrubbing. A scrub
reads all metadata and data, but not unused areas. mdadm raid depends
exclusively on drive reported errors, it has no independent means of
knowing which copy of a block is valid, because it has no integrity
checking. In the case of ext4/xfs metadata checksumming detecting a
checksum error in its own metadata, it doesn't know which drive
contains the correct copy, and neither does mdadm.

Intentionally mismatching different make/model drives actually results
in a more reliable setup on Btrfs because any firmware bugs in either
drive are isolated. Any bug resulting in corruption on one drive, gets
fixed by btrfs from the (meta)data on the other drive. With a regular
scrub you have less of a chance of getting bitten by such defects.

Depending on the size of the drives and how much data is on them,
'btrfs replace' can be quite a lot faster when replacing a failed
drive. This uses a variation on scrub to replicate (meta)data onto a
new device. I definitely recommend 'btrfs replace' as a go to for
replacing drive, rather than 'btrfs device add' followed by 'btrfs
device delete'. Likewise, this will do fixups as problems are
encountered as long as there's a good copy.

Btrfs also won't kick a drive out of a pool when misbehaving. Kicking
a drive out means any partial redundancy it could provide, is lost.
Since Btrfs can unambiguously determine if any reads from a drive are
corrupt, it's in a position to keep using it and handle the errors.
There is an option for 'btrfs replace' to only use the drive being
replaced if there are no other good copies of (meta)data found on
other drives to make replacement go faster.

Bitrot with anything that's already compressed, like audio, video, and
images, tends to cause significantly more damage than a mere bit or
byte flip might otherwise do. Detecting this and preventing corruption
from replicating is a significant feature of Btrfs.

Also the ability to add drives and grow the array is probably more
straight forward and tolerant of different sized drives. I don't mean
users will necessarily avoid confusion but the file system itself can
handle it if you add oddly sized drives one after another. 'btrfs
device add' iimplies mkfs, and resize. And it'll attempt to balance
based on the drives with the most free space available. It is
certainly possible to get confused if you don't add two drives at a
time of equal size.

As for monitoring, nagios check_btrfs might do what you want:
https://github.com/knorrie/python-btrfs/blob/master/examples/nagios/plugins/check_btrfs

There is a rather pernicious problem using consumer drives on Linux to
be aware of that affects mdadm, lvm, Btrfs and (I assume) ZFS raids.
And that's this esoteric annoyance of timeout mismatches:
https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

The gist of that is, the drive firmware's command time out needs to
happen before the kernel's. The typical point of confusion is that the
kernel's command timer looks like it's a device timer because it's a
per block device setting in sysfs. The ideal scenario is to not change
the kernel's timer, but use 'smartctl -l scterc,70,70'  using
something like /dev/disk/by-id in a udev rule, to tell the drive to
give up on errors quickly. 70 deciseconds is typical. All drives use
deciseconds. If the drive has no configurable SCT ERC, then you have
to change the kernel's timeout. If you don't, the kernel thinks the
drive isn't responding, does a link reset, and now the whole command
queue is lost and we have no idea why the drive wasn't responding. I
figure a greater than 90% chance it's a bad sector and the drive is
intentionally not responding because it's in "deep recovery" if it's a
consumer drive.

I know. I know. Sounds like bat guano.

There is a dracut bug that could cause some confusion if the drives
aren't both available at mount time. The btrfs udev rule causes a wait
for all btrfs devices to appear for a particular fs UUID before
systemd will attempt to mount it, to prevent mount failure. This
normally only affects btrfs multiple device volumes when used as a
system root. But if you have many different devices, possibly on
different controllers, set to automount in fstab, it could be an
issue.
https://github.com/dracutdevs/dracut/issues/947

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx