Re: Disk failures

Christian Balzer <chibi@xxxxxxx> · Thu, 9 Jun 2016 16:16:11 +0900

Hello,

On Thu, 9 Jun 2016 08:43:23 +0200 Gandalf Corvotempesta wrote:

> Il 09 giu 2016 02:09, "Christian Balzer" <chibi@xxxxxxx> ha scritto:
> > Ceph currently doesn't do any (relevant) checksumming at all, so if a
> > PRIMARY PG suffers from bit-rot this will be undetected until the next
> > deep-scrub.
> >
> > This is one of the longest and gravest outstanding issues with Ceph and
> > supposed to be addressed with bluestore (which currently doesn't have
> > checksum verified reads either).
> 
> So if bit rot happens on primary PG, ceph is spreading the currupted data
> across the cluster?
No.

You will want to re-read the Ceph docs and the countless posts here about
replication within Ceph works.
http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale

A client write goes to the primary OSD/PG and will not be ACK'ed to the
client until is has reached all replica OSDs.
This happens while the data is in-flight (in RAM), it's not read from the
journal or filestore.

> What would be sent to the replica,  the original data or the saved one?
> 
> When bit rot happens I'll have 1 corrupted object and 2 good.
> how do you manage this between deep scrubs?  Which data would be used by
> ceph? I think that a bitrot on a huge VM block device could lead to a
> mess like the whole device corrupted
> VM affected by bitrot would be able to stay up and running?
> And bitrot on a qcow2 file?
> 
Bitrot is a bit hyped, I haven't seen any on the Ceph clusters I run nor
on other systems here where I (can) actually check for it.

As to how it would affect things, that very much depends.

If it's something like a busy directory inode that gets corrupted, the data
in question will be in RAM (SLAB) and the next update  will correct things.

If it's a logfile, you're likely to never notice until deep-scrub detects
it eventually.

This isn't a  Ceph specific question, on all systems that aren't backed
by something like ZFS or BTRFS you're potentially vulnerable to this.

Of course if you're that worried, you could always run BTRFS of ZFS inside
your VM and notice immediately when something goes wrong.
I personally wouldn't though, due to the performance penalties involved
(CoW).

> Let me try to explain: when writing to primary PG i have to write bit "1"
> Due to a bit rot, I'm saving "0".
> Would ceph read the wrote bit and spread that across the cluster (so it
> will spread "0") or spread the in memory value "1" ?
> 
> What if the journal fails during a read or a write? 
Again, you may want to get a deeper understanding of Ceph.
The journal isn't involved in reads.

>Ceph is able to
> recover by removing that journal from the affected osd (and still
> running at lower speed) or should i use a raid1 on ssds used by journal ?
>
Neither, a journal failure is lethal for the OSD involved and unless you
have LOTS of money RAID1 SSDs are a waste.

If you use DC level SSDs with sufficient endurance (TBW) a failing SSD is
a very unlikely event.

Additionally your cluster should (NEEDS to) be designed to handle the
loss of a journal SSD and its associated OSDs, since that is less than a
whole node, or a whole rack (whatever your failure domain may be).

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com