Hello, On Thu, 9 Jun 2016 08:43:23 +0200 Gandalf Corvotempesta wrote: > Il 09 giu 2016 02:09, "Christian Balzer" <chibi@xxxxxxx> ha scritto: > > Ceph currently doesn't do any (relevant) checksumming at all, so if a > > PRIMARY PG suffers from bit-rot this will be undetected until the next > > deep-scrub. > > > > This is one of the longest and gravest outstanding issues with Ceph and > > supposed to be addressed with bluestore (which currently doesn't have > > checksum verified reads either). > > So if bit rot happens on primary PG, ceph is spreading the currupted data > across the cluster? No. You will want to re-read the Ceph docs and the countless posts here about replication within Ceph works. http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale A client write goes to the primary OSD/PG and will not be ACK'ed to the client until is has reached all replica OSDs. This happens while the data is in-flight (in RAM), it's not read from the journal or filestore. > What would be sent to the replica, the original data or the saved one? > > When bit rot happens I'll have 1 corrupted object and 2 good. > how do you manage this between deep scrubs? Which data would be used by > ceph? I think that a bitrot on a huge VM block device could lead to a > mess like the whole device corrupted > VM affected by bitrot would be able to stay up and running? > And bitrot on a qcow2 file? > Bitrot is a bit hyped, I haven't seen any on the Ceph clusters I run nor on other systems here where I (can) actually check for it. As to how it would affect things, that very much depends. If it's something like a busy directory inode that gets corrupted, the data in question will be in RAM (SLAB) and the next update will correct things. If it's a logfile, you're likely to never notice until deep-scrub detects it eventually. This isn't a Ceph specific question, on all systems that aren't backed by something like ZFS or BTRFS you're potentially vulnerable to this. Of course if you're that worried, you could always run BTRFS of ZFS inside your VM and notice immediately when something goes wrong. I personally wouldn't though, due to the performance penalties involved (CoW). > Let me try to explain: when writing to primary PG i have to write bit "1" > Due to a bit rot, I'm saving "0". > Would ceph read the wrote bit and spread that across the cluster (so it > will spread "0") or spread the in memory value "1" ? > > What if the journal fails during a read or a write? Again, you may want to get a deeper understanding of Ceph. The journal isn't involved in reads. >Ceph is able to > recover by removing that journal from the affected osd (and still > running at lower speed) or should i use a raid1 on ssds used by journal ? > Neither, a journal failure is lethal for the OSD involved and unless you have LOTS of money RAID1 SSDs are a waste. If you use DC level SSDs with sufficient endurance (TBW) a failing SSD is a very unlikely event. Additionally your cluster should (NEEDS to) be designed to handle the loss of a journal SSD and its associated OSDs, since that is less than a whole node, or a whole rack (whatever your failure domain may be). Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com