Re: Ceph journal - isn't it a bit redundant sometimes?

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 19 Oct 2015 23:45:51 +0200

> On 19 Oct 2015, at 23:15, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> 
> On Mon, Oct 19, 2015 at 11:18 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>> I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what other people using Ceph think.
>> 
>> If I were to use RADOS directly in my app I'd probably rejoice at its capabilities and how useful and non-legacy it is, but my use is basically for RBD volumes with OpenStack (libvirt, qemu...). And for that those capabilities are unneeded.
>> I live in this RBD bubble so that's all I know, but isn't this also the only usage pattern that 90% (or more) people using Ceph care about? Isn't this what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it comes to displacing traditional (DAS, SAN, NAS) solutions the overhead (=complexity) of Ceph?*
>> 
>> What are the apps that actually use the RADOS features? I know Swift has some RADOS backend (which does the same thing Swift already did by itself, maybe with stronger consistency?), RGW (which basically does the same as Swift?) - doesn't seem either of those would need anything special. What else is there?
>> Apps that needed more than POSIX semantics (like databases for transactions) already developed mechanisms to do that - how likely is my database server to replace those mechanisms with RADOS API and objects in the future? It's all posix-filesystem-centric and that's not going away.
>> 
>> Ceph feels like a perfect example of this https://en.wikipedia.org/wiki/Inner-platform_effect
>> 
>> I was really hoping there was an easy way to just get rid of journal and operate on filestore directly - that should suffice for anyone using RBD only  (in fact until very recently I thought it was possible to just disable journal in config...)
> 
> The biggest thing you're missing here is that Ceph needs to keep *its*
> data and metadata consistent. The filesystem journal does *not* let us
> do that, so we need a journal of our own.
> 

I get that, but I can't see any reason for the client IO to cause any change in this data.
Rebalancing? Maybe OK if it needs this state data. Changing CRUSH? OK, probably a good idea to have several copies that are checksummed and versioned and put somewhere super-safe.
But I see no need for client IO to pass through here, ever...

> Could something be constructed to do that more efficiently? Probably,
> with enough effort...but it's hard, and we don't have it right now,
> and it will still require a Ceph journal, because Ceph will always
> have its own metadata that needs to be kept consistent with its data.
> (Short example: rbd client sends two writes. OSDs crash and restart.
> client dies before they finish. OSDs try to reconstruct consistent
> view of the data. If OSDs don't have the right metadata about which
> writes have been applied, they can't tell who's got the newest data or
> if somebody's missing some piece of it, and without journaling you
> could get the second write applied but not the first, etc)
> 

If the writes were followed by a flush (barrier) then that blocks until the data (all data not flushed) is safe and durable on the disk. Whether that means in a journal or flushed to OSD filesystem makes no difference.
If the writes were not followed by a flush then anything can happen - there could be any state (like only the second write happening) and that's what the client _MUST_ be able to cope with, Ceph or not. It's the same as a physical drive - will it have the data or not after a crash? Who cares - the OS didn't get a confirmation so it's replayed (from filesystem journal in the guest, database transaction log, retried from application...).
Even if just the first write happened and then the whole cluster went down - no different then a power failure with local disk.
I can't see a scenario where something breaks - RBD is a block device, not a filesystem. The filesystem on top already has a journal and better understanding on what needs to be durable or not.
Until the guest VM asks for data to be durable, any state is acceptable.

You are right that without a "Ceph transaction log" it has no idea what was written and what wasn't - does that matter? It does not :-)
If a guest makes a write to a RBD image in a 3-replica cluster and power on all 3 OSDs involved goes down at the same moment, what can it expect?
Did the guest get a confirmation for the write or not?
If it did then all replicas are consistent at that one moment.
If it did not then there might be different objects on those 3 OSDs - so what? The guest doesn't care what data is there because no disk gives that guarantee. All Ceph needs to do is stick to one version  (by a simple timestamp possibly) and replicate it. Even if it was not the "best" copy, the guest filesystem must and will cope with that.
You're trying to bring consistency to something that doesn't really need it by design. Even if you dutifully preserve every single IO the guest did - if it didn't get that confirmation then it will not use the data anyway - it will retry the transaction and you're just wasting cycles.
HDDs don't have a journal, do they? And it works, that's what filesystems are for :)

And if you remember my (few months old) suggestion about durability when georeplicating - with that you get durability equal to any SAN/NAS, and you even more so don't need a journal to do that.... :)

Jan

> So no, just because RBD is a much simpler case doesn't mean we can
> drop our journaling. Sorry, but the world isn't fair.
> 
> On Mon, Oct 19, 2015 at 12:18 PM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA256
>> 
>> I think if there was a new disk format, we could get away without the
>> journal. It seems that Ceph is trying to do extra things because
>> regular file systems don't do exactly what is needed. I can understand
>> why the developers aren't excited about building and maintaining a new
>> disk format, but I think it could be pretty light and highly optimized
>> for object storage. I even started thinking through what one might
>> look like, but I've never written a file system so I'm probably just
>> living in a fantasy land. I still might try...
> 
> Well, there used to be one called EBOFS that Sage (mostly) wrote. He
> killed it because it had some limits, fixing them was hard, and it had
> basically turned into a full filesystem. Now he's trying again with
> NewStore, which will hopefully be super awesome and dramatically
> reduce stuff like the double writes. But it's a lot harder than you
> think; it's been his main dev topic off-and-on for almost a year, and
> you can see the thread he just started about it, so.... ;)
> 
> Basically what you're both running into is that any consistent system
> needs transactions, and providing them is a pain in the butt. Lots of
> applications actually don't bother, but a storage system like Ceph
> definitely does.
> -Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com