Re: Disk failures

Bill Sharer <bsharer@xxxxxxxxxxxxxx> · Wed, 15 Jun 2016 00:36:28 -0400

This is why I use btrfs mirror sets underneath ceph and hopefully more 
than make up for the space loss by going with 2 replicas instead of 3 
and on the fly lzo compression.  The ceph deep scrubs replace any need 
for btrfs scrubs, but I still get the benefit of self healing when btrfs 
finds bit rot.

The only errors I've run into are from hard shutdowns and possible ecc 
errors due to working with consumer hardware and memory.  I've been on 
top of btrfs using gentoo since Firefly.

Bill Sharer

On 06/14/2016 09:27 PM, Christian Balzer wrote:
Hello,

On Tue, 14 Jun 2016 14:26:41 +0200 Jan Schermer wrote:

Hi,
bit rot is not "bit rot" per se - nothing is rotting on the drive
platter.
Never mind that I used the wrong terminology (according to Wiki) and that
my long experience with "laser-rot" probably caused me to choose that
term, there are data degradation scenarios that are caused by
undetected media failures or by the corruption happening in the write
path, thus making them quite reproducible.

It occurs during reads (mostly, anyway), and it's random. You
can happily read a block and get the correct data, then read it again
and get garbage, then get correct data again. This could be caused by a
worn out cell on SSD but firmwares look for than and rewrite it if the
signal is attentuated too much. On spinners there are no cells to
refresh so rewriting it doesn't help either.

You can't really "look for" bit rot due to the reasons above, strong
checksumming/hash verification during reads is the only solution.

Which is what I've been saying in the mail below and for years on this ML.

And that makes deep-scrubbing something of quite limited value.

Christian
And trust me, bit rot is a very real thing and very dangerous as well -
do you think companies like Seagate or WD would lie about bit rot if
it's not real? I'd buy a drive with BER 10^999 over one with 10^14,
wouldn't everyone? And it is especially dangerous when something like
Ceph handles much larger blocks of data than the client does. While the
client (or an app) has some knowledge of the data _and_ hopefully throws
an error if it read garbage, Ceph will (if for example snapshots are
used and FIEMAP is off) actually have to read the whole object (say
4MiB) and write it elsewhere, without any knowledge whether what it read
(and wrote) made any sense to the app. This way corruption might spread
silently into your backups if you don't validate the data somehow (or
dump it from a database for example, where it's likely to get detected).

Btw just because you think you haven't seen it doesn't mean you haven't
seen it - never seen artefacting in movies? Just a random bug in the
decoder, is it? VoD guys would tell you...

For things like databases this is somewhat less impactful - bit rot
doesn't "flip a bit" but affects larger blocks of data (like one
sector), so databases usually catch this during read and err instead of
returning garbage to the client.

Jan

On 09 Jun 2016, at 09:16, Christian Balzer <chibi@xxxxxxx> wrote:

Hello,

On Thu, 9 Jun 2016 08:43:23 +0200 Gandalf Corvotempesta wrote:

Il 09 giu 2016 02:09, "Christian Balzer" <chibi@xxxxxxx> ha scritto:
Ceph currently doesn't do any (relevant) checksumming at all, so if a
PRIMARY PG suffers from bit-rot this will be undetected until the
next deep-scrub.

This is one of the longest and gravest outstanding issues with Ceph
and supposed to be addressed with bluestore (which currently doesn't
have checksum verified reads either).
So if bit rot happens on primary PG, ceph is spreading the currupted
data across the cluster?
No.

You will want to re-read the Ceph docs and the countless posts here
about replication within Ceph works.
http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale

A client write goes to the primary OSD/PG and will not be ACK'ed to the
client until is has reached all replica OSDs.
This happens while the data is in-flight (in RAM), it's not read from
the journal or filestore.

What would be sent to the replica,  the original data or the saved
one?

When bit rot happens I'll have 1 corrupted object and 2 good.
how do you manage this between deep scrubs?  Which data would be used
by ceph? I think that a bitrot on a huge VM block device could lead
to a mess like the whole device corrupted
VM affected by bitrot would be able to stay up and running?
And bitrot on a qcow2 file?

Bitrot is a bit hyped, I haven't seen any on the Ceph clusters I run
nor on other systems here where I (can) actually check for it.

As to how it would affect things, that very much depends.

If it's something like a busy directory inode that gets corrupted, the
data in question will be in RAM (SLAB) and the next update  will
correct things.

If it's a logfile, you're likely to never notice until deep-scrub
detects it eventually.

This isn't a  Ceph specific question, on all systems that aren't backed
by something like ZFS or BTRFS you're potentially vulnerable to this.

Of course if you're that worried, you could always run BTRFS of ZFS
inside your VM and notice immediately when something goes wrong.
I personally wouldn't though, due to the performance penalties involved
(CoW).

Let me try to explain: when writing to primary PG i have to write bit
"1" Due to a bit rot, I'm saving "0".
Would ceph read the wrote bit and spread that across the cluster (so
it will spread "0") or spread the in memory value "1" ?

What if the journal fails during a read or a write?
Again, you may want to get a deeper understanding of Ceph.
The journal isn't involved in reads.

Ceph is able to
recover by removing that journal from the affected osd (and still
running at lower speed) or should i use a raid1 on ssds used by
journal ?

Neither, a journal failure is lethal for the OSD involved and unless
you have LOTS of money RAID1 SSDs are a waste.

If you use DC level SSDs with sufficient endurance (TBW) a failing SSD
is a very unlikely event.

Additionally your cluster should (NEEDS to) be designed to handle the
loss of a journal SSD and its associated OSDs, since that is less than
a whole node, or a whole rack (whatever your failure domain may be).

Christian
--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com