Re: [ceph-users] Deprecating ext4 support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 12 Apr 2016, Jan Schermer wrote:
> Who needs to have exactly the same data in two separate objects 
> (replicas)? Ceph needs it because "consistency"?, but the app (VM 
> filesystem) is fine with whatever version because the flush didn't 
> happen (if it did the contents would be the same).

While we're talking/thinking about this, here's a simple example of why 
the simple solution (let the replicas be out of sync), which seems 
reasonable at first, can blow up in your face.

If a disk block contains A and you write B over the top of it and then 
there is a failure (e.g. power loss before you issue a flush), it's okay 
for the disk to contain either A or B.  In a replicated system, let's say 
2x mirroring (call them R1 and R2), you might end up with B on R1 and A 
on R2.  If you don't immediately clean it up, then at some point down the 
line you might switch from reading R1 to reading R2 and the disk block 
will go "back in time" (previously you read B, now you read A).  A 
single disk/replica will never do that, and applications can break.

For example, if the block in question is a journal block, we might see B 
the first time (valid journal!), the do a bunch of work and 
journal/write new stuff to the blocks that follow.  Then we lose 
power again, lose R1, replay the journal, read A from R2, and stop journal 
replay early... missing out on all the new stuff.  This can easily corrupt 
a file system or database or whatever else.

It might sound unlikely, but keep in mind that writes to these 
all-important metadata and commit blocks are extremely frequent.  It's the 
kind of thing you can usually get away with, until you don't, and then you 
have a very bad day...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux