It doesn't seem like it would be wise to run such systems on top of rbd. -Sam On Thu, Apr 14, 2016 at 11:05 AM, Jianjian Huo <samuel.huo@xxxxxxxxx> wrote: > On Wed, Apr 13, 2016 at 6:06 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: >> On Tue, 12 Apr 2016, Jan Schermer wrote: >>> Who needs to have exactly the same data in two separate objects >>> (replicas)? Ceph needs it because "consistency"?, but the app (VM >>> filesystem) is fine with whatever version because the flush didn't >>> happen (if it did the contents would be the same). >> >> While we're talking/thinking about this, here's a simple example of why >> the simple solution (let the replicas be out of sync), which seems >> reasonable at first, can blow up in your face. >> >> If a disk block contains A and you write B over the top of it and then >> there is a failure (e.g. power loss before you issue a flush), it's okay >> for the disk to contain either A or B. In a replicated system, let's say >> 2x mirroring (call them R1 and R2), you might end up with B on R1 and A >> on R2. If you don't immediately clean it up, then at some point down the >> line you might switch from reading R1 to reading R2 and the disk block >> will go "back in time" (previously you read B, now you read A). A >> single disk/replica will never do that, and applications can break. >> >> For example, if the block in question is a journal block, we might see B >> the first time (valid journal!), the do a bunch of work and >> journal/write new stuff to the blocks that follow. Then we lose >> power again, lose R1, replay the journal, read A from R2, and stop journal >> replay early... missing out on all the new stuff. This can easily corrupt >> a file system or database or whatever else. > > If data is critical, applications use their own replicas, MySQL, > Cassandra, MongoDB... if above scenario happens and one replica is out > of sync, they use quorum like protocol to guarantee reading the latest > data, and repair those out-of-sync replicas. so eventual consistency > in storage is acceptable for them? > > Jianjian >> >> It might sound unlikely, but keep in mind that writes to these >> all-important metadata and commit blocks are extremely frequent. It's the >> kind of thing you can usually get away with, until you don't, and then you >> have a very bad day... >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com