Re: Ceph journal - isn't it a bit redundant sometimes?

Jan Schermer <jan@xxxxxxxxxxx> · Mon, 19 Oct 2015 20:18:51 +0200

I'm sorry for appearing a bit dull (on purpose), I was hoping I'd hear what other people using Ceph think.

If I were to use RADOS directly in my app I'd probably rejoice at its capabilities and how useful and non-legacy it is, but my use is basically for RBD volumes with OpenStack (libvirt, qemu...). And for that those capabilities are unneeded.
I live in this RBD bubble so that's all I know, but isn't this also the only usage pattern that 90% (or more) people using Ceph care about? Isn't this what drives Ceph adoption in *Stack? Yet, isn't the biggest PITA when it comes to displacing traditional (DAS, SAN, NAS) solutions the overhead (=complexity) of Ceph?*

What are the apps that actually use the RADOS features? I know Swift has some RADOS backend (which does the same thing Swift already did by itself, maybe with stronger consistency?), RGW (which basically does the same as Swift?) - doesn't seem either of those would need anything special. What else is there?
Apps that needed more than POSIX semantics (like databases for transactions) already developed mechanisms to do that - how likely is my database server to replace those mechanisms with RADOS API and objects in the future? It's all posix-filesystem-centric and that's not going away.

Ceph feels like a perfect example of this https://en.wikipedia.org/wiki/Inner-platform_effect

I was really hoping there was an easy way to just get rid of journal and operate on filestore directly - that should suffice for anyone using RBD only  (in fact until very recently I thought it was possible to just disable journal in config...)

Jan

* look at what other solutions do to get better performance - RDMA for example. You can't really get true RDMA performance if you're not touching the drive DMA buffer (or something else very close to data) over network directly with minimal latency. That doesn't (IMHO) preclude software-defined-storage like Ceph from working over RDMA, but you probably should't try to outsmart the IO patterns...

> On 19 Oct 2015, at 19:44, James (Fei) Liu-SSI <james.liu@xxxxxxxxxxxxxxx> wrote:
> 
> Hi John,
>    Thanks for your explanations.
> 
>    Actually, clients can.  Clients can request fairly complex operations like "read an xattr, stop if it's not there, now write the following discontinuous regions of the file...".  RADOS executes these transactions atomically.
>    [James]  Could you mind detailing  a little bit more about operations in Rados transactions?  Is there any limits number of ops in one rados transaction? What if we come out similar transaction capabilities either in new file system or keyvalue store to map what rados transaction has?  If we can come out solution like what Jan proposed: 1:1 mapping for transactions between filesystem/keyvaluestore, we don't necessary to have journaling in objectstore which is going to dramatically improve the performance of Ceph.
> 
> Thanks.
> 
> Regards,
> James
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of John Spray
> Sent: Monday, October 19, 2015 3:44 AM
> To: Jan Schermer
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Ceph journal - isn't it a bit redundant sometimes?
> 
> On Mon, Oct 19, 2015 at 8:55 AM, Jan Schermer <jan@xxxxxxxxxxx> wrote:
>> I understand this. But the clients can't request something that 
>> doesn't fit a (POSIX) filesystem capabilities
> 
> Actually, clients can.  Clients can request fairly complex operations like "read an xattr, stop if it's not there, now write the following discontinuous regions of the file...".  RADOS executes these transactions atomically.
> 
> However, you are correct that for many cases (new files, sequential
> writes) it is possible to avoid the double write of data: the in-development newstore backend does that.  But we still have cases where we do fancier things than the backend (be it posix, or a KV
> store) can handle, so will have non-fast-path higher overhead ways of handling it.
> 
> John
> 
> That means the requests can map 1:1 into the filestore (O_FSYNC from client == O_FSYNC on the filestore object... ).
> Pagecache/io-schedulers are already smart enough to merge requests, preserve ordering - they just do the right thing already. It's true that in a distributed environment one async request can map to one OSD and then a synchronous one comes and needs the first one to be flushed beforehand, so that logic is presumably in place already - but I still don't see much need for a journal in there (btw in case of RBD with caching, this logic is probably not even needed at all and merging request in RBD cache makes more sense than merging somewhere down the line).
>> It might be faster to merge small writes in journal when the journal is on SSDs and filestore on spinning rust, but it will surely be slower (cpu bound by ceph-osd?) when the filestore is fast enough or when the merging is not optimal.
>> I have never touched anything but a pure SSD cluster, though - I have always been CPU bound and that's why I started thinking about this in the first place. I'd love to have my disks saturated with requests from clients one day.
>> 
>> Don't take this the wrong way, but I've been watching ceph perf talks and stuff and haven't seen anything that would make Ceph comparably fast to an ordinary SAN/NAS.
>> Maybe this is a completely wrong idea, I just think it might be worth thinking about.
>> 
>> Thanks
>> 
>> Jan
>> 
>> 
>>> On 14 Oct 2015, at 20:29, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
>>> 
>>> FileSystem like XFS guarantees a single file write but in Ceph transaction we are touching file/xattrs/leveldb (omap), so no way filesystem can guarantee that transaction. That's why FileStore has implemented a write_ahead journal. Basically, it is writing the entire transaction object there and only trimming from journal when it is actually applied (all the operation executed) and persisted in the backend.
>>> 
>>> Thanks & Regards
>>> Somnath
>>> 
>>> -----Original Message-----
>>> From: Jan Schermer [mailto:jan@xxxxxxxxxxx]
>>> Sent: Wednesday, October 14, 2015 9:06 AM
>>> To: Somnath Roy
>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>> Subject: Re:  Ceph journal - isn't it a bit redundant sometimes?
>>> 
>>> But that's exactly what filesystems and their own journals do already 
>>> :-)
>>> 
>>> Jan
>>> 
>>>> On 14 Oct 2015, at 17:02, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote:
>>>> 
>>>> Jan,
>>>> Journal helps FileStore to maintain the transactional integrity in the event of a crash. That's the main reason.
>>>> 
>>>> Thanks & Regards
>>>> Somnath
>>>> 
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
>>>> Behalf Of Jan Schermer
>>>> Sent: Wednesday, October 14, 2015 2:28 AM
>>>> To: ceph-users@xxxxxxxxxxxxxx
>>>> Subject:  Ceph journal - isn't it a bit redundant sometimes?
>>>> 
>>>> Hi,
>>>> I've been thinking about this for a while now - does Ceph really need a journal? Filesystems are already pretty good at committing data to disk when asked (and much faster too), we have external journals in XFS and Ext4...
>>>> In a scenario where client does an ordinary write, there's no need to flush it anywhere (the app didn't ask for it) so it ends up in pagecache and gets committed eventually.
>>>> If a client asks for the data to be flushed then fdatasync/fsync on the filestore object takes care of that, including ordering and stuff.
>>>> For reads, you just read from filestore (no need to differentiate between filestore/journal) - pagecache gives you the right version already.
>>>> 
>>>> Or is journal there to achieve some tiering for writes when the running spindles with SSDs? This is IMO the only thing ordinary filesystems don't do out of box even when filesystem journal is put on SSD - the data get flushed to spindle whenever fsync-ed (even with data=journal). But in reality, most of the data will hit the spindle either way and when you run with SSDs it will always be much slower. And even for tiering - there are already many options (bcache, flashcache or even ZFS L2ARC) that are much more performant and proven stable. I think the fact that people  have a need to combine Ceph with stuff like that already proves the point.
>>>> 
>>>> So a very interesting scenario would be to disable Ceph journal and at most use data=journal on ext4. The complexity of the data path would drop significantly, latencies decrease, CPU time is saved...
>>>> I just feel that Ceph has lots of unnecessary complexity inside that duplicates what filesystems (and pagecache...) have been doing for a while now without eating most of our CPU cores - why don't we use that? Is it possible to disable journal completely?
>>>> 
>>>> Did I miss something that makes journal essential?
>>>> 
>>>> Jan
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> ________________________________
>>>> 
>>>> PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
>>>> 
>>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com