Re: ZFS or BTRFS for performance?

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Sun, 20 Mar 2016 00:45:47 +0100



    Le 19/03/2016 18:38, Heath Albritton a
      écrit :

    
      If you google "ceph bluestore" you'll be able to find a
        couple slide decks on the topic.  One of them by Sage is easy to
        follow without the benefit of the presentation.  There's also
        the " Redhat Ceph Storage Roadmap 2016" deck.
      

      In any case, bluestore is not intended to address bitrot.
         Given that ceph is a distributed file system, many of the posix
        file system features are not required for the underlying block
        storage device.  Bluestore is intended to address this and
        reduce the disk IO required to store user data.
      

      Ceph protects against bitrot at a much higher level by
        validating the checksum of the entire placement group during a
        deep scrub.
    
    
    My impression is that the only protection against bitrot is provided
    by the underlying filesystem which means that you don't get any if
    you use XFS or EXT4.

    
    I can't trust Ceph on this alone until its bitrot protection (if
    any) is clearly documented. The situation is far from clear right
    now. The documentations states that deep scrubs are using checksums
    to validate data, but this is not good enough at least because we
    don't known what these checksums are supposed to cover (see below
    for another reason). There is even this howto by Sebastien Han about
    repairing a PG :

http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/

    which clearly concludes that with only 2 replicas you can't reliably
    find out which object is corrupted with Ceph alone. If Ceph really
    stored checksums to verify all the objects it stores we could
    manually check which replica is valid.

    
    Even if deep scrubs would use checksums to verify data this would
    not be enough to protect against bitrot: there is a window between a
    corruption event and a deep scrub where the data on a primary can be
    returned to a client. BTRFS solves this problem by returning an IO
    error for any data read that doesn't match its checksum (or
    automatically rebuilds it if the allocation group is using
    RAID1/10/5/6). I've never seen this kind of behavior documented for
    Ceph.

    
    Lionel

  
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com