Re: Ceph journal - isn't it a bit redundant sometimes?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 19 Oct 2015 14:15:09 -0700

On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
> I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what other people using Ceph think.
>
> If I were to use RADOS directly in my app I'd probably rejoice at its capabilities and how useful and non-legacy it is, but my use is basically for RBD volumes with OpenStack (libvirt, qemu...). And for that those capabilities are unneeded.
> I live in this RBD bubble so that's all I know, but isn't this also the only usage pattern that 90% (or more) people using Ceph care about? Isn't this what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it comes to displacing traditional (DAS, SAN, NAS) solutions the overhead (=complexity) of Ceph?*
>
> What are the apps that actually use the RADOS features? I know Swift has some RADOS backend (which does the same thing Swift already did by itself, maybe with stronger consistency?), RGW (which basically does the same as Swift?) - doesn't seem either of those would need anything special. What else is there?
> Apps that needed more than POSIX semantics (like databases for transactions) already developed mechanisms to do that - how likely is my database server to replace those mechanisms with RADOS API and objects in the future? It's all posix-filesystem-centric and that's not going away.
>
> Ceph feels like a perfect example of this https://en.wikipedia.org/wiki/Inner-platform_effect
>
> I was really hoping there was an easy way to just get rid of journal and operate on filestore directly - that should suffice for anyone using RBD only  (in fact until very recently I thought it was possible to just disable journal in config...)

The biggest thing you're missing here is that Ceph needs to keep *its*
data and metadata consistent. The filesystem journal does *not* let us
do that, so we need a journal of our own.

Could something be constructed to do that more efficiently? Probably,
with enough effort...but it's hard, and we don't have it right now,
and it will still require a Ceph journal, because Ceph will always
have its own metadata that needs to be kept consistent with its data.
(Short example: rbd client sends two writes. OSDs crash and restart.
client dies before they finish. OSDs try to reconstruct consistent
view of the data. If OSDs don't have the right metadata about which
writes have been applied, they can't tell who's got the newest data or
if somebody's missing some piece of it, and without journaling you
could get the second write applied but not the first, etc)

So no, just because RBD is a much simpler case doesn't mean we can
drop our journaling. Sorry, but the world isn't fair.

On Mon, Oct 19, 2015 at 12:18 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I think if there was a new disk format, we could get away without the
> journal. It seems that Ceph is trying to do extra things because
> regular file systems don't do exactly what is needed. I can understand
> why the developers aren't excited about building and maintaining a new
> disk format, but I think it could be pretty light and highly optimized
> for object storage. I even started thinking through what one might
> look like, but I've never written a file system so I'm probably just
> living in a fantasy land. I still might try...

Well, there used to be one called EBOFS that Sage (mostly) wrote. He
killed it because it had some limits, fixing them was hard, and it had
basically turned into a full filesystem. Now he's trying again with
NewStore, which will hopefully be super awesome and dramatically
reduce stuff like the double writes. But it's a lot harder than you
think; it's been his main dev topic off-and-on for almost a year, and
you can see the thread he just started about it, so.... ;)

Basically what you're both running into is that any consistent system
needs transactions, and providing them is a pain in the butt. Lots of
applications actually don't bother, but a storage system like Ceph
definitely does.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com