Re: ZFS or BTRFS for performance?

Christian Balzer <chibi@xxxxxxx> · Sun, 20 Mar 2016 12:47:03 +0900

Hello,

On Sun, 20 Mar 2016 00:45:47 +0100 Lionel Bouton wrote:

> Le 19/03/2016 18:38, Heath Albritton a écrit :
> > If you google "ceph bluestore" you'll be able to find a couple slide
> > decks on the topic.  One of them by Sage is easy to follow without the
> > benefit of the presentation.  There's also the " Redhat Ceph Storage
> > Roadmap 2016" deck.
> >
> > In any case, bluestore is not intended to address bitrot.  Given that
> > ceph is a distributed file system, many of the posix file system
> > features are not required for the underlying block storage device.
> >  Bluestore is intended to address this and reduce the disk IO required
> > to store user data.
> >
> > Ceph protects against bitrot at a much higher level by validating the
> > checksum of the entire placement group during a deep scrub.
> 
That's not protection, that's an "uh-oh, something is wrong, you better
check it out" notification, after which you get to spend a lot of time
figuring out which is the good replica and as Lionel wrote in the case of
just 2 replicas and faced with binary data you might as well roll a dice.

Completely unacceptable and my oldest pet peeve about Ceph.

I'd be deeply disappointed if bluestore would go ahead ignoring that
elephant in the room as well.

> My impression is that the only protection against bitrot is provided by
> the underlying filesystem which means that you don't get any if you use
> XFS or EXT4.
> 
Indeed.

> I can't trust Ceph on this alone until its bitrot protection (if any) is
> clearly documented. The situation is far from clear right now. The
> documentations states that deep scrubs are using checksums to validate
> data, but this is not good enough at least because we don't known what
> these checksums are supposed to cover (see below for another reason).
> There is even this howto by Sebastien Han about repairing a PG :
> http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
> which clearly concludes that with only 2 replicas you can't reliably
> find out which object is corrupted with Ceph alone. If Ceph really
> stored checksums to verify all the objects it stores we could manually
> check which replica is valid.
> 
AFAIK it uses checksums created on the fly to compare the data during
deep-scrubs.
I also recall talks about having permanent checksums stored, but no idea
what the status is.

> Even if deep scrubs would use checksums to verify data this would not be
> enough to protect against bitrot: there is a window between a corruption
> event and a deep scrub where the data on a primary can be returned to a
> client. BTRFS solves this problem by returning an IO error for any data
> read that doesn't match its checksum (or automatically rebuilds it if
> the allocation group is using RAID1/10/5/6). I've never seen this kind
> of behavior documented for Ceph.
> 
Ditto.

And if/when Ceph has reliable checksumming (in the storage layer) it
should definitely get auto-repair abilities as well.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com