Re: How do really work RAID1 on btrfs?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Tue, 8 Dec 2020 20:42:07 -0700

On Tue, Dec 8, 2020 at 5:08 PM Kevin Kofler via devel
<devel@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Sergio Belkin wrote:
> > So, let's say we have 3 small disks: 4GB, 3G, and 2GB.
> >
> > If I create one file of 3GB I think that
> > 3 GB is written on 4GB disk, it leaves 1 GB free.
> > 3 GB  of copy is written on 3 GB disk, it leaves 0 GB Free.
> >
> > So, I create one file of 1GB that is written on 4GB disk, it leaves 0 GB
> > free.
> > 1 GB of copy is written on 2 GB disk, so it leaves 1 GB free.
> >
> > So I've used 4GB, ok it leaves 1 GB free on only one disk, but cannot be
> > mirrored.
> >
> > However as [1] I could use 4.5 ((4GB+3GB+2GB)/2) GB instead of 4GB.
> > Surely, I'm missing or mistaking something.
> >
> > Please could you help me?
>
> The optimum size can theoretically be achieved by using the following
> physical partitioning:
> * x GB on the 4 GB disk and the 3 GB disk,
> * y GB on the 4 GB disk and the 2 GB disk, and
> * z GB on the 3 GB disk and the 2 GB disk,
> for a total of x+y+z GB, where x, y, and z solve the following system of
> equations:
> * x+y=4
> * x+z=3
> * y+z=2
> i.e., in standard form:
> * 1x+1y+0z=4
> * 1x+0y+1z=3
> * 0x+1y+1z=2
> The determinant of this system is -2, which is not 0, so this system admits
> a unique solution. It can be computed using any method to solve linear
> systems of equations, such as direct substitution (solving an equation for a
> variable and substituting it), Gauss elimination with back substitution,
> Gauss-Jordan (bidirectional) elimination, or Cramer's rule. The result is:
> * x=2.5
> * y=1.5
> * z=0.5
> for a total of x+y+z=2.5+1.5+0.5=4.5 GB.
>
> Now how btrfs actually handles this in practice is a different story.
> Judging from Chris Murphy's reply, it does not precompute the above
> repartition, but tries to dynamically select 2 disks for each newly
> allocated 1 GB block to approximate the optimal solution for large enough
> drives (which will not achieve the optimum for the sizes in your example
> because the optimum allocation is not an integer amount of gigabytes, and
> will in fact be pretty far from the optimum due to the small sizes, whereas
> the larger the disk sizes, the less noticeable the loss is).

It's a bit more complicated still. The block group size is typically
1G but in reality it's variable, depending on file system size, and
unallocated space remaining. I don't know the minimum size, although I
have seen 128MB data block groups.

The reason block groups are not set in advance, is because there are
different types of block groups: data and metadata. File system blocks
go in metadata block groups, and blocks for file data go in data block
groups. And the ratio of data to metadata usage is workload dependent.
Some workloads produce heavy metadata others less so.

Why separate block groups? They can have different block sizes and
redundancy profiles, e.g. by default 16KiB block size for metadata,
4KiB for data. And by default hard drives have dup metadata, single
data; and 2+ device file systems will get raid1 for metadata and
single for data. But it's this way for efficiency and features. I'll
stop here before I fall into a balance, resize, multiple device rabbit
hole.

(dup = two copies on a single device, can also apply to data)

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx