Re: Disk failures

Christian Balzer <chibi@xxxxxxx> · Wed, 15 Jun 2016 10:27:09 +0900

Hello,

On Tue, 14 Jun 2016 14:26:41 +0200 Jan Schermer wrote:

> Hi,
> bit rot is not "bit rot" per se - nothing is rotting on the drive
> platter. 

Never mind that I used the wrong terminology (according to Wiki) and that
my long experience with "laser-rot" probably caused me to choose that
term, there are data degradation scenarios that are caused by
undetected media failures or by the corruption happening in the write
path, thus making them quite reproducible. 

> It occurs during reads (mostly, anyway), and it's random. You
> can happily read a block and get the correct data, then read it again
> and get garbage, then get correct data again. This could be caused by a
> worn out cell on SSD but firmwares look for than and rewrite it if the
> signal is attentuated too much. On spinners there are no cells to
> refresh so rewriting it doesn't help either. 
> 
> You can't really "look for" bit rot due to the reasons above, strong
> checksumming/hash verification during reads is the only solution.
> 
Which is what I've been saying in the mail below and for years on this ML.

And that makes deep-scrubbing something of quite limited value.

Christian
> And trust me, bit rot is a very real thing and very dangerous as well -
> do you think companies like Seagate or WD would lie about bit rot if
> it's not real? I'd buy a drive with BER 10^999 over one with 10^14,
> wouldn't everyone? And it is especially dangerous when something like
> Ceph handles much larger blocks of data than the client does. While the
> client (or an app) has some knowledge of the data _and_ hopefully throws
> an error if it read garbage, Ceph will (if for example snapshots are
> used and FIEMAP is off) actually have to read the whole object (say
> 4MiB) and write it elsewhere, without any knowledge whether what it read
> (and wrote) made any sense to the app. This way corruption might spread
> silently into your backups if you don't validate the data somehow (or
> dump it from a database for example, where it's likely to get detected).
> 
> Btw just because you think you haven't seen it doesn't mean you haven't
> seen it - never seen artefacting in movies? Just a random bug in the
> decoder, is it? VoD guys would tell you...
> 
> For things like databases this is somewhat less impactful - bit rot
> doesn't "flip a bit" but affects larger blocks of data (like one
> sector), so databases usually catch this during read and err instead of
> returning garbage to the client.
> 
> Jan
> 
> 
> 
> > On 09 Jun 2016, at 09:16, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > 
> > Hello,
> > 
> > On Thu, 9 Jun 2016 08:43:23 +0200 Gandalf Corvotempesta wrote:
> > 
> >> Il 09 giu 2016 02:09, "Christian Balzer" <chibi@xxxxxxx> ha scritto:
> >>> Ceph currently doesn't do any (relevant) checksumming at all, so if a
> >>> PRIMARY PG suffers from bit-rot this will be undetected until the
> >>> next deep-scrub.
> >>> 
> >>> This is one of the longest and gravest outstanding issues with Ceph
> >>> and supposed to be addressed with bluestore (which currently doesn't
> >>> have checksum verified reads either).
> >> 
> >> So if bit rot happens on primary PG, ceph is spreading the currupted
> >> data across the cluster?
> > No.
> > 
> > You will want to re-read the Ceph docs and the countless posts here
> > about replication within Ceph works.
> > http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale
> > 
> > A client write goes to the primary OSD/PG and will not be ACK'ed to the
> > client until is has reached all replica OSDs.
> > This happens while the data is in-flight (in RAM), it's not read from
> > the journal or filestore.
> > 
> >> What would be sent to the replica,  the original data or the saved
> >> one?
> >> 
> >> When bit rot happens I'll have 1 corrupted object and 2 good.
> >> how do you manage this between deep scrubs?  Which data would be used
> >> by ceph? I think that a bitrot on a huge VM block device could lead
> >> to a mess like the whole device corrupted
> >> VM affected by bitrot would be able to stay up and running?
> >> And bitrot on a qcow2 file?
> >> 
> > Bitrot is a bit hyped, I haven't seen any on the Ceph clusters I run
> > nor on other systems here where I (can) actually check for it.
> > 
> > As to how it would affect things, that very much depends.
> > 
> > If it's something like a busy directory inode that gets corrupted, the
> > data in question will be in RAM (SLAB) and the next update  will
> > correct things.
> > 
> > If it's a logfile, you're likely to never notice until deep-scrub
> > detects it eventually.
> > 
> > This isn't a  Ceph specific question, on all systems that aren't backed
> > by something like ZFS or BTRFS you're potentially vulnerable to this.
> > 
> > Of course if you're that worried, you could always run BTRFS of ZFS
> > inside your VM and notice immediately when something goes wrong.
> > I personally wouldn't though, due to the performance penalties involved
> > (CoW).
> > 
> > 
> >> Let me try to explain: when writing to primary PG i have to write bit
> >> "1" Due to a bit rot, I'm saving "0".
> >> Would ceph read the wrote bit and spread that across the cluster (so
> >> it will spread "0") or spread the in memory value "1" ?
> >> 
> >> What if the journal fails during a read or a write? 
> > Again, you may want to get a deeper understanding of Ceph.
> > The journal isn't involved in reads.
> > 
> >> Ceph is able to
> >> recover by removing that journal from the affected osd (and still
> >> running at lower speed) or should i use a raid1 on ssds used by
> >> journal ?
> >> 
> > Neither, a journal failure is lethal for the OSD involved and unless
> > you have LOTS of money RAID1 SSDs are a waste.
> > 
> > If you use DC level SSDs with sufficient endurance (TBW) a failing SSD
> > is a very unlikely event.
> > 
> > Additionally your cluster should (NEEDS to) be designed to handle the
> > loss of a journal SSD and its associated OSDs, since that is less than
> > a whole node, or a whole rack (whatever your failure domain may be).
> > 
> > Christian
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com