Re: ZFS or BTRFS for performance?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Le 19/03/2016 18:38, Heath Albritton a écrit :
If you google "ceph bluestore" you'll be able to find a couple slide decks on the topic.  One of them by Sage is easy to follow without the benefit of the presentation.  There's also the " Redhat Ceph Storage Roadmap 2016" deck.

In any case, bluestore is not intended to address bitrot.  Given that ceph is a distributed file system, many of the posix file system features are not required for the underlying block storage device.  Bluestore is intended to address this and reduce the disk IO required to store user data.

Ceph protects against bitrot at a much higher level by validating the checksum of the entire placement group during a deep scrub.

My impression is that the only protection against bitrot is provided by the underlying filesystem which means that you don't get any if you use XFS or EXT4.

I can't trust Ceph on this alone until its bitrot protection (if any) is clearly documented. The situation is far from clear right now. The documentations states that deep scrubs are using checksums to validate data, but this is not good enough at least because we don't known what these checksums are supposed to cover (see below for another reason). There is even this howto by Sebastien Han about repairing a PG :
http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/
which clearly concludes that with only 2 replicas you can't reliably find out which object is corrupted with Ceph alone. If Ceph really stored checksums to verify all the objects it stores we could manually check which replica is valid.

Even if deep scrubs would use checksums to verify data this would not be enough to protect against bitrot: there is a window between a corruption event and a deep scrub where the data on a primary can be returned to a client. BTRFS solves this problem by returning an IO error for any data read that doesn't match its checksum (or automatically rebuilds it if the allocation group is using RAID1/10/5/6). I've never seen this kind of behavior documented for Ceph.

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux