Hello, On Sun, 20 Mar 2016 00:45:47 +0100 Lionel Bouton wrote: > Le 19/03/2016 18:38, Heath Albritton a écrit : > > If you google "ceph bluestore" you'll be able to find a couple slide > > decks on the topic. One of them by Sage is easy to follow without the > > benefit of the presentation. There's also the " Redhat Ceph Storage > > Roadmap 2016" deck. > > > > In any case, bluestore is not intended to address bitrot. Given that > > ceph is a distributed file system, many of the posix file system > > features are not required for the underlying block storage device. > > Bluestore is intended to address this and reduce the disk IO required > > to store user data. > > > > Ceph protects against bitrot at a much higher level by validating the > > checksum of the entire placement group during a deep scrub. > That's not protection, that's an "uh-oh, something is wrong, you better check it out" notification, after which you get to spend a lot of time figuring out which is the good replica and as Lionel wrote in the case of just 2 replicas and faced with binary data you might as well roll a dice. Completely unacceptable and my oldest pet peeve about Ceph. I'd be deeply disappointed if bluestore would go ahead ignoring that elephant in the room as well. > My impression is that the only protection against bitrot is provided by > the underlying filesystem which means that you don't get any if you use > XFS or EXT4. > Indeed. > I can't trust Ceph on this alone until its bitrot protection (if any) is > clearly documented. The situation is far from clear right now. The > documentations states that deep scrubs are using checksums to validate > data, but this is not good enough at least because we don't known what > these checksums are supposed to cover (see below for another reason). > There is even this howto by Sebastien Han about repairing a PG : > http://www.sebastien-han.fr/blog/2015/04/27/ceph-manually-repair-object/ > which clearly concludes that with only 2 replicas you can't reliably > find out which object is corrupted with Ceph alone. If Ceph really > stored checksums to verify all the objects it stores we could manually > check which replica is valid. > AFAIK it uses checksums created on the fly to compare the data during deep-scrubs. I also recall talks about having permanent checksums stored, but no idea what the status is. > Even if deep scrubs would use checksums to verify data this would not be > enough to protect against bitrot: there is a window between a corruption > event and a deep scrub where the data on a primary can be returned to a > client. BTRFS solves this problem by returning an IO error for any data > read that doesn't match its checksum (or automatically rebuilds it if > the allocation group is using RAID1/10/5/6). I've never seen this kind > of behavior documented for Ceph. > Ditto. And if/when Ceph has reliable checksumming (in the storage layer) it should definitely get auto-repair abilities as well. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com