Re: Filesystem-based raid vs. device-based raid

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Fri, 21 Sep 2018 10:39:04 +0100

On 20/09/18 20:52, David F wrote:
> I can't imagine that this isn't a frequently asked question, but with my
> poor search skills, I've come up completely empty on this.
> 
> I'm trying to understand why the newer "sophisticated" filesystems (e.g.
> btrfs) are implementing raid redundancy as part of the filesystem rather
> than the traditional approach of a separate virtual-block-device layer
> such as md, with a filesystem on top of it as a distinct layer.  In
> addition to replication of effort/code [again and again for each new
> filesystem implementation that comes along], it seems to be mixing too
> much functionality into one monolithic layer, increasing complexity and
> the subsequent inevitable increased number of bugs and difficulty of
> debugging.

As others have said, it's a simple trade-off. Separating filesystem and
raid into two separate layers makes both layers *much* simpler.
Combining the two makes intelligent recovery *much* easier. You pays
your money, and you makes your choice.
> 
> Of course, the people working on these filesystems aren't idiots, so I
> assume that there _are_ reasons, but in speculating what they are, I
> don't come up with much that seems to me to overcome the inherent
> disadvantages of the integrated approach.  The primary thing that I've
> thought of is the ability to use fil> Thank you very much in advance,
> David
esystem-specific information to
> optimize raid operations, such as awareness that a given block is not
> currently in use by any file, so does not need to be redundantly stored
> (e.g. during a resync operation). Yet surely an API could be created to
> allow cross-layer communication of this sort of information, similar to
> the TRIM command of SSDs.

There's a difference between intelligence (yes these people aren't
idiots), and experience - it seems that the raid-5/6 code in btrfs is
fubar beyond repair, and the fix is to throw it out and start over
again. Intelligent people cock up. What was that I said combining the
two adds complexity?

I've been thinking about adding a TRIM functionality to md-raid, and in
theory it's dead easy. Just track all writes, have a bitmap per sector
or track or whatever to record what's in use, and bob's your uncle. It
just chews up a load more disk space to record that bitmap, adds more
complexity to the code, and needs someone to code it up. And as always,
that last requirement is probably the hardest one to fill. This is
exactly the sort of hardening / maintenance coding I would love to do,
but I just can't find the time/hardware to get stuck in.
> 
> Can anyone shed light on the relative advantages and disadvantages of
> integrating raid redundancy functionality directly into a filesystem?
> 
Just draw up a list of pros and cons. You'll find plenty on either side
if you think about it. Then as I say, you pays your money, you makes
your choice.

NB - If you do get a decent number of ideas and can write it up, I'd
love to have it to add to the wiki :-)

Cheers,
Wol