Re: Storage, File Systems and Data Scrubbing

Sage Weil <sage@xxxxxxxxxxx> · Tue, 27 Aug 2013 09:26:00 -0700 (PDT)

On Tue, 27 Aug 2013, ker can wrote:
> This was very helpful -thanks.  However I'm still trying to reconcile this
> with something that Sage mentioned a while back on a similar topic.
> Apparently you can disable the journal if you're using  btrfs.  Is that
> possible because btrfs takes care of things like atomic object writes and
> updates to the osd metadata ? 

It's because with btrfs we take snapshots that are consistent checkpoints.  
You *can* disable the journal, but it means that writes only commit when 
a new checkpoint is made (i.e., snapshot), which is a infrequent and 
relatively expensive operation.. so in general the write latency is 
terrible.  This is useful only for workloads where you are doing bulk data 
inject (for example) and write latency is not important.

sage

> 
> 
> -----Original Message-----
> From: ceph-users-bounces@xxxxxxxxxxxxxx
> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Sage Weil
> Sent: Thursday, July 11, 2013 8:39 PM
> To: Mark Nelson
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Turning off ceph journaling with xfs ?
> 
>  
> 
> Note that you *can* disable teh journal if you use btrfs, but your write
> latency will tend to be pretty terrible.  This is only viable for
> bulk-storage use cases where throughput trumps all and latency is not an
> issue at all (it may be seconds).
> 
>  
> 
> We are planning on eliminating the double-write for at least large writes
> when using btrfs by cloning data out of the journal and into the target
> file.  This is not a hugely complex task (although it is non-trivial) but it
> hasn't made it to the top of the priority list yet.
> 
>  
> 
> sage
> 
> 
> 
> On Mon, Aug 26, 2013 at 4:05 PM, Samuel Just <sam.just@xxxxxxxxxxx> wrote:
>       ceph-osd builds a transactional interface on top of the usual
>       posix
>       operations so that we can do things like atomically perform an
>       object
>       write and update the osd metadata.  The current implementation
>       requires our own journal and some metadata ordering (which is
>       provided
>       by the backing filesystem's own journal) to implement our own
>       atomic
>       operations.  It's true that in some cases you might be able to
>       get
>       away with having the client replay the operation (which we do
>       anyway
>       for other reasons), but that wouldn't be enough to ensure
>       consistency
>       of the filesystem's own internal structures.  It also wouldn't
>       be
>       enough to ensure that the OSD's internal structure remain
>       consistent
>       in the case of a crash.  Also, if the client is unavailable to
>       do the
>       replay, you'd have a problem.
> 
>       In summary, it's actually really hard to to detect
>       partial/corrupted
>       writes after a crash without journaling of some form.
>       -Sam
> 
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com