Re: hardware RAID1 NVMe card?

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Wed, 6 Jul 2022 11:14:41 -0400

On Thu, Jun 23, 2022 at 11:05 PM ToddAndMargo via users
<users@xxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi All,
>
> Any of you guys know of a PCIe card that will do
> hardware RAID 1 with two NVMe drives?
>
> I have found some, but they are way to elaborate,
> and as such, way too expensive.

I'm  really not certain how sophisticated or reliable either PCIe or
NVMe is with respect to error reporting. Or even if it varies by
make/model. My understanding is that internally it has to be good
because your data isn't really stored in any recognizable form on
solid state drives, it's a "probabilistic representation of your data"
and requires really sophisticated encoding/decoding to "almost
certainly" return your data. But when that doesn't happen, curiously
it (anecdotally) seems rare to get discrete read errors like we see
with hard drives. Common instead, the drive returns garbage or zeros
instead of your data. This is where btrfs shines, in general, but
really shines in the raid1 configuration.

In the normal single drive configuration, Btrfs will verbosely
complain. It has limited ability to correct when the metadata profile
is dup (two copies of the file system on one drive), which is the mkfs
default since btrfs-progs ~5.15. For various reasons, even dup might
have two bad copies on a single SSD.

But in the raid1 configuration (two copies on different devices),
Btrfs can unambiguously determine on every read whether data or
metadata is wrong, and grab the good copy from the other drive, and
overwrite the bad copy. And this is all automatic. You can see the
same scary verbose message in dmesg, but you'll see additional
messages for the fixups. Fixup also happens during scrub, useful for
the areas that aren't regularly read.

Conversely, any hardware, mdadm, or LVM RAID depends on the hardware
reporting a read error. If garbage or zeros are returned, the RAID
can't do anything about it. [1]

Sounds great. So why not btrfs raid1? Well, right now the code that
handles degraded mdadm RAID is all in dracut (in the initramfs). The
initramfs contains dracut scripts that try to assemble the RAID and if
a drive is missing, it won't assemble, so the scripts know to start a
loop to wait for about 3 minutes, and then attempt a degraded
assemble. But dracut doesn't handle Btrfs in the same situation, and
no one has done the work so far to make it possible. If a drive flat
out dies, what happens at boot time is you get an indefinite wait for
the device to appear, because of a udev rule that requires waiting for
all Btrfs devices to appear before mount is attempted. That's good
because we don't want to prematurely try to do a normal or degraded
mount. Anyway, this area needs development work. So if your use case
requires unattended boot when a drive has failed, this set up is not
for you.

So those are the current trade offs.

[1] There's experimental dm-integrity support via cryptsetup. It works
rather differently than Btrfs, but has the ability to detect such
corruption problems and report them to the upper layer as a read error
where the normal RAID error correction can then work properly.

-- 
Chris Murphy
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure