Comments on Ceph.com's blog article 'Ceph's New Monitor Changes'

Joao Eduardo Luis <joao.luis@xxxxxxxxxxx> · Mon, 11 Mar 2013 12:04:03 +0000

Last Friday, Florian Haas (CC'ed) commented with regard to the Monitor 
changes blog post [1] on Google+ [2].  This is a transcription of the 
resulting thread, in which Greg (CC'ed) also participated, and I am now 
cross-posting it to the list for the benefit of the larger community on 
ceph-devel that might not stumble upon the post (although it is public 
and should not require a G+ account).

[1] - http://ceph.com/dev-notes/cephs-new-monitor-changes/
[2] - https://plus.google.com/110443614427234590648/posts/iuxSyCC5aJp

  -Joao

Florian Haas on Mar 8, 2013 wrote:

Good to see more Ceph developers providing their insight on recent
and ongoing codebase changes.

+Joao Eduardo Luis, I have a comment about the transition from the
file-backed mon store to a leveldb k/v store. As the original
reporter of   http://tracker.ceph.com/issues/2752, I'm always
wondering how to best recover if any issue simultaneously bringing
down all mons were ever to happen again. At the time, +Gregory
Farnum's suggestion for recovery
(http://marc.info/?l=ceph-devel&m=134151077312444&w=2) involved
manipulating files in the mon data directory manually; I wonder if
the leveldb approach makes this easier or harder.

It tends to be a common source of discomfort among potential Ceph
users that if their mons ever become unrecoverable, it's almost
impossible to recover your data (compare to GlusterFS, where you can
always pull data out of Gluster bricks unharmed, at least as long as
you don't use striping volumes). With a file backed mon store, I had
hoped that eventually this might tie into btrfs snapshots such that
you would have been able to roll back to a known good configuration
in an emergency. With the switch to leveldb, I no longer foresee that
ever happening. Mind sharing your thoughts on that?

Gregory Farnum on Mar 8, 2013 wrote:

Actually LevelDB snapshots just fine.

I'll leave the rest for Joao, or me after I've been awake for more
than a minute. :)

Florian Haas on Mar 8, 2013 wrote:

Thanks. So leveldb updates are always atomic, consistent, isolated
and durable?

Also, do the mons open level leveldb fd with O_SYNC, or do they
periodically do fsync() or fdatasync()?

Gregory Farnum on Mar 8, 2013 wrote:
>
I'm not entirely sure -- leveldb handles all that stuff on its own
and  provides the right guarantees at the interface level, so I assume its
doing those things on its own. ;)
(In particular it's actually a hierarchy, not a single file.)

Joao Eduardo Luis on Mar 9, 2013 wrote:

When it comes to manipulating the contents of the store, yeah, it's
harder now. You'll need a tool that speaks leveldb for that. Which IMO
can actually be nice if we have a tool that allows one to perform minor
incursions in the store in a somewhat automated fashion (say, revert to
an older osdmap) -- and creating a tool to change the store's contents
isn't hard to do either, we just didn't get around to put it together
(there's one however to read the store's contents, adapting it should be
easy).

And I can see the discomfort in having the mon store on leveldb instead
of the FS. I had never thought about backing up the mon store resorting
to, say, btrfs snapshots, as I've been approaching it on a 'distributed
is path to success' and maintaining multiple mons around. So I guess
that it also might make things harder, if btrfs snapshots (or any other
snapshot tools for that matter) don't play nice with leveldb (or
vice-versa). I looked on the internets for a potential solution, and it
looks like someone at stackoverflow [1] recommended performing some
copies and creating some hard links of some of leveldb's contents to
achieve that -- it's not a pretty solution and would require the monitor
to block accesses to the store while performing such operation.

Also, it is worth to mention that leveldb does support snapshots, but
they will all be lost after leveldb is closed, so there's no gain from
such support when attempting to create checkpoints (unless there's
something I've missed altogether and this is in fact possible!).
Furthermore, operations are applied in batches (we have a nifty
interface that abstracts them as transactions, but they're not really
ACID transactions, although they benefit from Atomicity, Durability and
some form of Consistency and limited Isolation [2]), and we force them
to be written synchronously to leveldb. This basically means that if the
whole batch successfully reaches the disk, everything should be okay; if
the system crashes sometime in the middle of it, leveldb will
automatically ignore partial writes, and for a batch of operations that
will mean the whole batch.

By the way, I believe this discussion should be moved to ceph-devel, as
it can be beneficial to other members of the community. If no one
objects, I will do as such later today.

[1] - http://goo.gl/7HSCH
[2] - Atomicity is internally guaranteed by leveldb, resorting to their
own magic; Consistency as well, guaranteed by ignoring partial writes if
a system should fail, but leveldb mostly leaves it up to the user wrt
how things are flushed to disk; Isolation is fairly limited: you can
only have one process accessing leveldb at each time, but multiple
threads can do as they please, and although leveldb will take care of
most of the required synchronization, some thought should still go into
that; and Durability is said to be configurable, but I have no idea what
that means -- am yet to find a way to make not durable, not that I
really need it, but would be nice to know. :-)

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html