Re: Deprecating ext4 support

Samuel Just <sjust@xxxxxxxxxx> · Thu, 14 Apr 2016 11:30:23 -0700



It doesn't seem like it would be wise to run such systems on top of rbd.
-Sam

On Thu, Apr 14, 2016 at 11:05 AM, Jianjian Huo <samuel.huo@xxxxxxxxx> wrote:
> On Wed, Apr 13, 2016 at 6:06 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> On Tue, 12 Apr 2016, Jan Schermer wrote:
>>> Who needs to have exactly the same data in two separate objects
>>> (replicas)? Ceph needs it because "consistency"?, but the app (VM
>>> filesystem) is fine with whatever version because the flush didn't
>>> happen (if it did the contents would be the same).
>>
>> While we're talking/thinking about this, here's a simple example of why
>> the simple solution (let the replicas be out of sync), which seems
>> reasonable at first, can blow up in your face.
>>
>> If a disk block contains A and you write B over the top of it and then
>> there is a failure (e.g. power loss before you issue a flush), it's okay
>> for the disk to contain either A or B.  In a replicated system, let's say
>> 2x mirroring (call them R1 and R2), you might end up with B on R1 and A
>> on R2.  If you don't immediately clean it up, then at some point down the
>> line you might switch from reading R1 to reading R2 and the disk block
>> will go "back in time" (previously you read B, now you read A).  A
>> single disk/replica will never do that, and applications can break.
>>
>> For example, if the block in question is a journal block, we might see B
>> the first time (valid journal!), the do a bunch of work and
>> journal/write new stuff to the blocks that follow.  Then we lose
>> power again, lose R1, replay the journal, read A from R2, and stop journal
>> replay early... missing out on all the new stuff.  This can easily corrupt
>> a file system or database or whatever else.
>
> If data is critical, applications use their own replicas, MySQL,
> Cassandra, MongoDB... if above scenario happens and one replica is out
> of sync, they use quorum like protocol to guarantee reading the latest
> data, and repair those out-of-sync replicas. so eventual consistency
> in storage is acceptable for them?
>
> Jianjian
>>
>> It might sound unlikely, but keep in mind that writes to these
>> all-important metadata and commit blocks are extremely frequent.  It's the
>> kind of thing you can usually get away with, until you don't, and then you
>> have a very bad day...
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com