Re: Storage, File Systems and Data Scrubbing

ker can <kercan74@xxxxxxxxx> · Tue, 27 Aug 2013 10:09:01 -0500

This was very helpful -thanks.  However I'm still trying to reconcile this with something that Sage mentioned a while back on a similar topic. Apparently you can disable the journal if you're using  btrfs.  Is that possible because btrfs takes care of things like atomic object writes and updates to the osd metadata ? 

-----Original Message-----

From: ceph-users-bounces@xxxxxxxxxxxxxx
[mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Sage Weil

Sent: Thursday, July 11, 2013 8:39 PM

To: Mark Nelson

Cc: ceph-users@xxxxxxxxxxxxxx

Subject: Re:  Turning off ceph journaling with xfs ?

Note that you *can* disable teh journal if you use btrfs,
but your write latency will tend to be pretty terrible.  This is only viable for bulk-storage use
cases where throughput trumps all and latency is not an issue at all (it may be
seconds).

We are planning on eliminating the double-write for at
least large writes when using btrfs by cloning data out of the journal and into
the target file.  This is not a hugely
complex task (although it is non-trivial) but it hasn't made it to the top of
the priority list yet.

sage

On Mon, Aug 26, 2013 at 4:05 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:

ceph-osd builds a transactional interface on top of the usual posix

operations so that we can do things like atomically perform an object

write and update the osd metadata.  The current implementation

requires our own journal and some metadata ordering (which is provided

by the backing filesystem's own journal) to implement our own atomic

operations.  It's true that in some cases you might be able to get

away with having the client replay the operation (which we do anyway

for other reasons), but that wouldn't be enough to ensure consistency

of the filesystem's own internal structures.  It also wouldn't be

enough to ensure that the OSD's internal structure remain consistent

in the case of a crash.  Also, if the client is unavailable to do the

replay, you'd have a problem.

In summary, it's actually really hard to to detect partial/corrupted

writes after a crash without journaling of some form.

-Sam

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com