Re: What is the best way to combine 4 HDD's and 2 SSD's into a single filesystem?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Sun, 17 Jan 2021 18:59:33 -0700

On Sun, Jan 17, 2021 at 1:43 PM <Cedric.dewijs@xxxxxxxxxx> wrote:
>
> I have 4 slow, loud, big, power hungry and old hard drives, and 2 SSD's. I'm trying to come up with a way to combine them into a system that has the following characteristics:
>
> A) The hard drives stop spinning 5 minutes after they have been used.
> B) The SSD's are used for read and write caching. Writes to the system are absorbed by the SSD's. Only when the ssd's are full of dirty data, then the hard drives are woken up. (This means the SSD's contain dirty data for potentially a long time.)
> C) When data is requested that's not present on the SSD's (a read cache miss), then the hard drive which has that data is woken up.
> D) When a hard drive is woken up as a result of a read cache miss, then the SSD's write out the dirty data to that drive.
> E) If one drive fails, or starts to produce random data, the system must return the correct data to the user.
>
> First idea is to use this stack of bcache and btrfs:
> +--------------------------------------------+--------------+
> |               btrfs raid 1 (2 copies) /mnt                |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 |/dev/bcache3  |
> +--------------+--------------+--------------+--------------+
> |                       Cache (SSD)                         |
> |                       /dev/sda4                           |
> +--------------+--------------+--------------+--------------+
> | Data HDD     | Data HDD     | Data HDD     |Data HDD      |
> | /dev/sda8    | /dev/sda9    | /dev/sda10   |/dev/sda11    |
> +--------------+--------------+--------------+--------------+
> The good:
> Btrfs in raid 1 is able to handle a failing hard drive, both when it failed completely, and when it corrupts data.
> Bcache is capable of using an ssd to cache the read and the write requests from btrfs.
> The not-so-good:
> Bcache can only use one SSD, so using bcache is only possible as read cache in order to achieve characteristic E, but this prevents characteristic B to be achieved.
> I can't get bcache to read-ahead the data that is adjacent to the data that has just been accessed.
>
> Second idea is to use a SSD in front of each hard drive:
> +-----------------------------------------------------------+
> |                btrfs raid 1 (2 copies) /mnt               |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Cache SSD    |  Cache SSD   |  Cache SSD   |   Cache SSD  |
> | /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup achieves all characteristics I'm after
> The not-so-good:
> This requires more SSD's and more (SATA) ports than I have.
> I can't get bcache to read-ahead the data that is adjacent to the data that has just been accessed.
>
> Third idea is to use mdadm to create a raid 0 array out of the 2 SSD's to create a fault tolerant write cache:
> +-----------------------------------------------------------+
> |                 btrfs raid 1 (2 copies) /mnt              |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 |/dev/bcache3  |
> +--------------+--------------+--------------+--------------+
> |                      bcache Cache                         |
> |                         /dev/md0                          |
> +-----------------------------------------------------------+
> |               mdadm raid 0 array /dev/md0                 |
> |             SSD /dev/sda4 and SSD /dev/sda5               |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup is capable of achieving all characteristics I'm after. It can handle abrupt failure of a single drive.
> The not-so-good:
> When one of the SSD's start to produce random data, mdadm is not able to know what SSD produces correct data, and data is lost. (both copies of the data btrfs is trying to write to underlying storage are on the 2 SSD's.
>
> Fourth idea is to use dm-cache. Dm-cache can only cache one backing device, and it has no way to use 2 cache devices.
> +-----------------------------------------------------------+
> |                btrfs raid 1 (2 copies) /mnt               |
> +--------------+--------------+--------------+--------------+
> | /dev/bcache0 | /dev/bcache1 | /dev/bcache2 | /dev/bcache3 |
> +--------------+--------------+--------------+--------------+
> | Cache SSD    |  Cache SSD   |  Cache SSD   |   Cache SSD  |
> | /dev/sda5    | /dev/sda6    | /dev/sda7    | /dev/sda8    |
> +--------------+--------------+--------------+--------------+
> | Data         | Data         | Data         | Data         |
> | /dev/sda9    | /dev/sda10   | /dev/sda11   |/dev/sda12    |
> +--------------+--------------+--------------+--------------+
> The good:
> This setup is capable of achieving all characteristics I'm after.
> The not-so-good:
> This requires more SSD's and more (SATA) ports than I have.
>
> What options do I have to create the desired setup?
> Is it feasible to add a checksum to mdadm, much like btrfs has, so it can tell what drive (if any) has returned the correct data?
>
> Is this the correct mailing list to ask these questions?
>
> ---
>
> Take your mailboxes with you. Free, fast and secure Mail &amp; Cloud: https://www.eclipso.eu - Time to change!
>

I think all the options are complicated, and lack sufficient hardware
isolation to withstand SSD failure.

I think all the options are complicated in a recovery/restore context.
I think you're better off with two separate storage setups, this
isolates problems and makes recovery much simpler and more robust.

A

HDD1+SSD1=>
                            Btrfs raid1 profile for metadata, raid0 or
raid1 profile for data depending on your risk tolerance
HDD2+SSD2=>

B

HDD3+HDD4=> Btrfs raid1 profile for metadata and data.

Setup btrbk to take snapshots and use btrfs send/receive to cheaply
replicate A to B. Configure a scheduled scrub of A and B once a month.

Assuming A has more storage capacity than B, you can have important vs
throwaway subvolumes. Important is backed up, and throwaway isn't.

If A has problems and can't be repaired, just blow it away and
recreate it from scratch, then restore from the latest snapshot on B.

Still another variation would be to skip the cache, and setup a
separate file system on the SSDs.

-- 
Chris Murphy