Re: Ceph journal - isn't it a bit redundant sometimes?

Josh Durgin <jdurgin@xxxxxxxxxx> · Mon, 19 Oct 2015 16:43:54 -0700

On 10/19/2015 02:45 PM, Jan Schermer wrote:

On 19 Oct 2015, at 23:15, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what other people using Ceph think.

If I were to use RADOS directly in my app I'd probably rejoice at its capabilities and how useful and non-legacy it is, but my use is basically for RBD volumes with OpenStack (libvirt, qemu...). And for that those capabilities are unneeded.
I live in this RBD bubble so that's all I know, but isn't this also the only usage pattern that 90% (or more) people using Ceph care about? Isn't this what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it comes to displacing traditional (DAS, SAN, NAS) solutions the overhead (=complexity) of Ceph?*

What are the apps that actually use the RADOS features? I know Swift has some RADOS backend (which does the same thing Swift already did by itself, maybe with stronger consistency?), RGW (which basically does the same as Swift?) - doesn't seem either of those would need anything special. What else is there?
Apps that needed more than POSIX semantics (like databases for transactions) already developed mechanisms to do that - how likely is my database server to replace those mechanisms with RADOS API and objects in the future? It's all posix-filesystem-centric and that's not going away.

Ceph feels like a perfect example of this https://en.wikipedia.org/wiki/Inner-platform_effect

I was really hoping there was an easy way to just get rid of journal and operate on filestore directly - that should suffice for anyone using RBD only  (in fact until very recently I thought it was possible to just disable journal in config...)

The biggest thing you're missing here is that Ceph needs to keep *its*
data and metadata consistent. The filesystem journal does *not* let us
do that, so we need a journal of our own.

I get that, but I can't see any reason for the client IO to cause any change in this data.
Rebalancing? Maybe OK if it needs this state data. Changing CRUSH? OK, probably a good idea to have several copies that are checksummed and versioned and put somewhere super-safe.
But I see no need for client IO to pass through here, ever...

Could something be constructed to do that more efficiently? Probably,
with enough effort...but it's hard, and we don't have it right now,
and it will still require a Ceph journal, because Ceph will always
have its own metadata that needs to be kept consistent with its data.
(Short example: rbd client sends two writes. OSDs crash and restart.
client dies before they finish. OSDs try to reconstruct consistent
view of the data. If OSDs don't have the right metadata about which
writes have been applied, they can't tell who's got the newest data or
if somebody's missing some piece of it, and without journaling you
could get the second write applied but not the first, etc)

If the writes were followed by a flush (barrier) then that blocks until the data (all data not flushed) is safe and durable on the disk. Whether that means in a journal or flushed to OSD filesystem makes no difference.
If the writes were not followed by a flush then anything can happen - there could be any state (like only the second write happening) and that's what the client _MUST_ be able to cope with, Ceph or not. It's the same as a physical drive - will it have the data or not after a crash? Who cares - the OS didn't get a confirmation so it's replayed (from filesystem journal in the guest, database transaction log, retried from application...).
Even if just the first write happened and then the whole cluster went down - no different then a power failure with local disk.
I can't see a scenario where something breaks - RBD is a block device, not a filesystem. The filesystem on top already has a journal and better understanding on what needs to be durable or not.
Until the guest VM asks for data to be durable, any state is acceptable.

You are right that without a "Ceph transaction log" it has no idea what was written and what wasn't - does that matter? It does not :-)
If a guest makes a write to a RBD image in a 3-replica cluster and power on all 3 OSDs involved goes down at the same moment, what can it expect?
Did the guest get a confirmation for the write or not?
If it did then all replicas are consistent at that one moment.
If it did not then there might be different objects on those 3 OSDs - so what? The guest doesn't care what data is there because no disk gives that guarantee. All Ceph needs to do is stick to one version  (by a simple timestamp possibly) and replicate it. Even if it was not the "best" copy, the guest filesystem must and will cope with that.
You're trying to bring consistency to something that doesn't really need it by design. Even if you dutifully preserve every single IO the guest did - if it didn't get that confirmation then it will not use the data anyway - it will retry the transaction and you're just wasting cycles.
HDDs don't have a journal, do they? And it works, that's what filesystems are for :)

This works for local HDDs because guests know when they need to replay
operations - powercycling a machine restarts the guest and the HDD.
With Ceph and other distributed storage systems, a storage node going
down is not visible to a guest, so the filesystem or DB or app in the
guest has no idea that it needs to replay its journal.

Josh

And if you remember my (few months old) suggestion about durability when georeplicating - with that you get durability equal to any SAN/NAS, and you even more so don't need a journal to do that.... :)

Jan

So no, just because RBD is a much simpler case doesn't mean we can
drop our journaling. Sorry, but the world isn't fair.

On Mon, Oct 19, 2015 at 12:18 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I think if there was a new disk format, we could get away without the
journal. It seems that Ceph is trying to do extra things because
regular file systems don't do exactly what is needed. I can understand
why the developers aren't excited about building and maintaining a new
disk format, but I think it could be pretty light and highly optimized
for object storage. I even started thinking through what one might
look like, but I've never written a file system so I'm probably just
living in a fantasy land. I still might try...

Well, there used to be one called EBOFS that Sage (mostly) wrote. He
killed it because it had some limits, fixing them was hard, and it had
basically turned into a full filesystem. Now he's trying again with
NewStore, which will hopefully be super awesome and dramatically
reduce stuff like the double writes. But it's a lot harder than you
think; it's been his main dev topic off-and-on for almost a year, and
you can see the thread he just started about it, so.... ;)

Basically what you're both running into is that any consistent system
needs transactions, and providing them is a pain in the butt. Lots of
applications actually don't bother, but a storage system like Ceph
definitely does.
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com