Re: Disk failures

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
bit rot is not "bit rot" per se - nothing is rotting on the drive platter. It occurs during reads (mostly, anyway), and it's random.
You can happily read a block and get the correct data, then read it again and get garbage, then get correct data again.
This could be caused by a worn out cell on SSD but firmwares look for than and rewrite it if the signal is attentuated too much.
On spinners there are no cells to refresh so rewriting it doesn't help either. 

You can't really "look for" bit rot due to the reasons above, strong checksumming/hash verification during reads is the only solution.

And trust me, bit rot is a very real thing and very dangerous as well - do you think companies like Seagate or WD would lie about bit rot if it's not real?
I'd buy a drive with BER 10^999 over one with 10^14, wouldn't everyone?
And it is especially dangerous when something like Ceph handles much larger blocks of data than the client does.
While the client (or an app) has some knowledge of the data _and_ hopefully throws an error if it read garbage, Ceph will (if for example snapshots
are used and FIEMAP is off) actually have to read the whole object (say 4MiB) and write it elsewhere, without any knowledge whether what it read (and wrote) made any sense to the app.
This way corruption might spread silently into your backups if you don't validate the data somehow (or dump it from a database for example, where it's likely to get detected).

Btw just because you think you haven't seen it doesn't mean you haven't seen it - never seen artefacting in movies? Just a random bug in the decoder, is it? VoD guys would tell you...

For things like databases this is somewhat less impactful - bit rot doesn't "flip a bit" but affects larger blocks of data (like one sector), so databases usually catch this during read and err
instead of returning garbage to the client.

Jan



> On 09 Jun 2016, at 09:16, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> 
> Hello,
> 
> On Thu, 9 Jun 2016 08:43:23 +0200 Gandalf Corvotempesta wrote:
> 
>> Il 09 giu 2016 02:09, "Christian Balzer" <chibi@xxxxxxx> ha scritto:
>>> Ceph currently doesn't do any (relevant) checksumming at all, so if a
>>> PRIMARY PG suffers from bit-rot this will be undetected until the next
>>> deep-scrub.
>>> 
>>> This is one of the longest and gravest outstanding issues with Ceph and
>>> supposed to be addressed with bluestore (which currently doesn't have
>>> checksum verified reads either).
>> 
>> So if bit rot happens on primary PG, ceph is spreading the currupted data
>> across the cluster?
> No.
> 
> You will want to re-read the Ceph docs and the countless posts here about
> replication within Ceph works.
> http://docs.ceph.com/docs/hammer/architecture/#smart-daemons-enable-hyperscale
> 
> A client write goes to the primary OSD/PG and will not be ACK'ed to the
> client until is has reached all replica OSDs.
> This happens while the data is in-flight (in RAM), it's not read from the
> journal or filestore.
> 
>> What would be sent to the replica,  the original data or the saved one?
>> 
>> When bit rot happens I'll have 1 corrupted object and 2 good.
>> how do you manage this between deep scrubs?  Which data would be used by
>> ceph? I think that a bitrot on a huge VM block device could lead to a
>> mess like the whole device corrupted
>> VM affected by bitrot would be able to stay up and running?
>> And bitrot on a qcow2 file?
>> 
> Bitrot is a bit hyped, I haven't seen any on the Ceph clusters I run nor
> on other systems here where I (can) actually check for it.
> 
> As to how it would affect things, that very much depends.
> 
> If it's something like a busy directory inode that gets corrupted, the data
> in question will be in RAM (SLAB) and the next update  will correct things.
> 
> If it's a logfile, you're likely to never notice until deep-scrub detects
> it eventually.
> 
> This isn't a  Ceph specific question, on all systems that aren't backed
> by something like ZFS or BTRFS you're potentially vulnerable to this.
> 
> Of course if you're that worried, you could always run BTRFS of ZFS inside
> your VM and notice immediately when something goes wrong.
> I personally wouldn't though, due to the performance penalties involved
> (CoW).
> 
> 
>> Let me try to explain: when writing to primary PG i have to write bit "1"
>> Due to a bit rot, I'm saving "0".
>> Would ceph read the wrote bit and spread that across the cluster (so it
>> will spread "0") or spread the in memory value "1" ?
>> 
>> What if the journal fails during a read or a write? 
> Again, you may want to get a deeper understanding of Ceph.
> The journal isn't involved in reads.
> 
>> Ceph is able to
>> recover by removing that journal from the affected osd (and still
>> running at lower speed) or should i use a raid1 on ssds used by journal ?
>> 
> Neither, a journal failure is lethal for the OSD involved and unless you
> have LOTS of money RAID1 SSDs are a waste.
> 
> If you use DC level SSDs with sufficient endurance (TBW) a failing SSD is
> a very unlikely event.
> 
> Additionally your cluster should (NEEDS to) be designed to handle the
> loss of a journal SSD and its associated OSDs, since that is less than a
> whole node, or a whole rack (whatever your failure domain may be).
> 
> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux