FileSystem like XFS guarantees a single file write but in Ceph transaction we are touching file/xattrs/leveldb (omap), so no way filesystem can guarantee that transaction. That's why FileStore has implemented a write_ahead journal. Basically, it is writing the entire transaction object there and only trimming from journal when it is actually applied (all the operation executed) and persisted in the backend. Thanks & Regards Somnath -----Original Message----- From: Jan Schermer [mailto:jan@xxxxxxxxxxx] Sent: Wednesday, October 14, 2015 9:06 AM To: Somnath Roy Cc: ceph-users@xxxxxxxxxxxxxx Subject: Re: Ceph journal - isn't it a bit redundant sometimes? But that's exactly what filesystems and their own journals do already :-) Jan > On 14 Oct 2015, at 17:02, Somnath Roy <Somnath.Roy@xxxxxxxxxxx> wrote: > > Jan, > Journal helps FileStore to maintain the transactional integrity in the event of a crash. That's the main reason. > > Thanks & Regards > Somnath > > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Jan Schermer > Sent: Wednesday, October 14, 2015 2:28 AM > To: ceph-users@xxxxxxxxxxxxxx > Subject: Ceph journal - isn't it a bit redundant sometimes? > > Hi, > I've been thinking about this for a while now - does Ceph really need a journal? Filesystems are already pretty good at committing data to disk when asked (and much faster too), we have external journals in XFS and Ext4... > In a scenario where client does an ordinary write, there's no need to flush it anywhere (the app didn't ask for it) so it ends up in pagecache and gets committed eventually. > If a client asks for the data to be flushed then fdatasync/fsync on the filestore object takes care of that, including ordering and stuff. > For reads, you just read from filestore (no need to differentiate between filestore/journal) - pagecache gives you the right version already. > > Or is journal there to achieve some tiering for writes when the running spindles with SSDs? This is IMO the only thing ordinary filesystems don't do out of box even when filesystem journal is put on SSD - the data get flushed to spindle whenever fsync-ed (even with data=journal). But in reality, most of the data will hit the spindle either way and when you run with SSDs it will always be much slower. And even for tiering - there are already many options (bcache, flashcache or even ZFS L2ARC) that are much more performant and proven stable. I think the fact that people have a need to combine Ceph with stuff like that already proves the point. > > So a very interesting scenario would be to disable Ceph journal and at most use data=journal on ext4. The complexity of the data path would drop significantly, latencies decrease, CPU time is saved... > I just feel that Ceph has lots of unnecessary complexity inside that duplicates what filesystems (and pagecache...) have been doing for a while now without eating most of our CPU cores - why don't we use that? Is it possible to disable journal completely? > > Did I miss something that makes journal essential? > > Jan > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com