Re: btrfs RAID 5?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Mon, 4 Jan 2021 21:26:47 -0700

On Sun, Jan 3, 2021 at 3:09 PM Richard Shaw <hobbes1069@xxxxxxxxx> wrote:
>
> On Sun, Jan 3, 2021 at 3:34 PM Chris Murphy <lists@xxxxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On Sun, Jan 3, 2021, 6:26 AM Richard Shaw <hobbes1069@xxxxxxxxx> wrote:
>>>
>>> Chris,
>>>
>>> Thanks for the detailed explanation. Too much to quote for my follow up question :)
>>>
>>> So for 3 drives and my desire to have more capacity and redundancy (for drive failure) would I be better off with RAID1 or RAID5 w/ btrfs?
>>
>>
>> Depends on what you mean by better. :D In terms of survivability of your data? You're better off with more independent copies/backups than you are with any kind of raid. Raid improves availability, i.e. instead of 0% working it has a degraded mode where it's mostly working, but will require specific actions to make it healthy again, before it also becomes 0% working.
>
>
> Yeah, the RAID1 seems a lot easier with the caveat that the free space reporting is bogus, which may be important for a media drive. :) The RAID5 caveats don't scare me too much.

The odd number device raid1 free space reporting issue is 'df'
specific. If you try it out and fallocate a bunch of files 10G at a
time (no writes for fallocate, it's fast) you can see the goofy thing
that happens in the bug report. It isn't ever 100% wrong, but it is
confusing. The btrfs specific commands tell the truth always: btrf fi
df is short and sweet; btrfs fi us is very information dense.

metadata raid1 + data raid5 is fine if you're ok with caveats email
from Zygo; people have run it for years and it's saved their butt in
other ways. The Btrfs write hole thing is not as bad as other write
holes, because while btrfs raid5/6 does not checksum parity it will
still spit out csum error upon reconstructing from bad parity. So
while it's a form of partial data loss, which would be the case with
any raid5 in the same situation, you at least get warned about it and
it doesn't propagate into your backups or user space, etc.

Toss up on xxhash64, it's as fast or faster than crc32c to compute,
collision resistance, but csum algo is a mkfs time only option - only
reason why I mention it. I can write more upon request.

If these are consumer drives: (a) timeout mismatch (b) disable each
drive's write cache. This is not btrfs specific advice, applies to
mdadm and LVM raid as well. Maybe someone has udev rules for this
somewhere and if not we ought to get them into Fedora somehow. hdparm
-W is the command, -w is dangerous (!). It is ok to use the write
cache if it's determined that the drive firmware honors flush/fua,
which they usually do, but the penalty is so bad if they don't and you
get a crash that it's maybe not worth taking the risk. Btrfs raid1
metadata helps here, as will using different drive make/models,
because if one drive does the wrong thing, btrfs self heals from
another drive even passively - but for parity raid you really ought to
scrub following a crash. Or hey, just avoid crashes :)

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

>> Striped parity raids perform well for sequential workloads. They're not good for metadata heavy workloads. Btrfs alters this calculation because metadata (the fs itself) can have a different profile than data. i.e. the recommendation is to use raid1 metadata when using raid5 data; and raid1c3 metadata when using raid6 data.
>
>
> For a media drive (smallest file is a few MB ogg to 30GB movie) I don't think things will be very metadata heavy.

Yeah it's fine. The strip size (mdadm calls it chunk, which is an
entirely different thing on btrfs - of course) is 64KiB so even
smaller files will not have bad performance.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx