Re: How do really work RAID1 on btrfs?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Tue, 8 Dec 2020 16:21:21 -0700

On Tue, Dec 8, 2020 at 12:22 PM Sergio Belkin <sebelk@xxxxxxxxx> wrote:
>
> Hi!
> I've read the explanation about how much space is available using disks with different sizes[1]. I understand the rules, but I see a contradiction with definition of RAID-1 in btrs:
>
> «A form of RAID which stores two complete copies of each piece of data. Each copy is stored on a different device. btrfs requires a minimum of two devices to use RAID-1. This is the default for btrfs's metadata on more than one device.
>
> So, let's say we have 3 small disks: 4GB, 3G, and 2GB.

>From the btrfs perspective, this is a 9G file system, with raid1
metadata and data block groups. The "raidness" happens at the block
group level, it is not at the device level like mdadm raid.

Deep dive: Block groups are a logical range of bytes (variable size,
typically 1G). Where and what drive a file extent actually exists on
is a function of the block group to chunk mapping. i.e. a 1G data
block group using raid1 profile, physically exists as two 1G chunks,
each one on two devices. What this means is internally to Btrfs it
sees everything as just one copy in a virtual address space, and it's
a function of the chunk tree and allocator to handle the details of
exactly where it's located physically and how it's replicated. It's
normal to not totally grok this, it's pretty esoteric, but if there's
one complicated thing to try to get about Btrfs, it's this. Because
once you get it, all the other unique/unusual/confusing things start
to make sense.

Because the "pool" is 9G, and each 1G of data results in two 1G
"mirror" chunks, each written on two drives, writes consume double the
space. Two copies for raid1. The 'btrfs filesystem usage' command
reveals this reality. Whereas 'df' kinda lies to try and make it
behave more like what we've come to expect with more conventional
raid1 implementation. This lie works ok for even number of same size
devices. It starts to fall apart [1] with odd number of drives, and
odd sized devices. So you're likely to run up against some still
remaining issues in 'df' reporting in this example.

https://carfax.org.uk/btrfs-usage/

Set three disks. On the right side, use preset raid1. Go down to
Devices sizes and enter 4000,3000,2000. And it'll show you what
happens.

> If I create one file of 3GB I think that
> 3 GB is written on 4GB disk, it leaves 1 GB free.
> 3 GB  of copy is written on 3 GB disk, it leaves 0 GB Free.

It's more complicated than that because first it'll be broken up into
3 1GB block groups (possibly more and smaller block groups), and then
the allocator tries to maintain equal free space. That means it'll
tend to initially write to the biggest and 2nd biggest drives, but it
won't fill either of them up. It'll start writing to the smaller
device once it has more space than the free space in the middle
device. And yep, it can split up chunks like this, sorta like Tetris.

The example size 9G is perhaps not a great example of real world
allocation for btrfs raid1, I'd bump that to T :) 9G is even below the
threshold of USB sticks you can buy off the shelf these days.

>
> So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB free.
> 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free.
>
> So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be mirrored.
>
> However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB. Surely, I'm missing or mistaking something.

Block groups and chunks. There's lots of reused jargon in btrfs that
sounds familiar but it's not the same as mdadm or lvm, they're just
reused terms. Another example: raid1 or raid10 on btrfs don't work
like you're used to with mdadm and LVM. i.e. raid10 on btrfs is not a
""stripe of mirrored drives" it is "striped and mirrored block
groups". man mkfs.btrfs has quite concise and important information
about such things, and of course questions welcome.

So it's worth knowing a bit about how it works differently so you can
properly assess (a) if it fits for your use case and meets your
expectations (b) how to maintain and manage it, in particular disaster
recovery. Because that too is different.

[1]
https://github.com/kdave/btrfs-progs/issues/277

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx