Hi Nicheal, 1. The main purpose of journal is provide transaction semantics (prevent partially update). Peer is not enough for this need because ceph writes all replica at the same time, so when crush, you have no idea about which replica has right data. For example, say if we have 2 replica, user update a 4M object and the primary OSD crush when the first 2M was written , secondary OSD may also failed when the first 3MB was written. So both versions in primary/secondary are neither the new value, nor the old value, and have no way to recover. So share the same idea as database, we need to have a journal to support transaction and prevent this happen. For some backend support transaction, BTRFS as an instance, we don't need a journal, we can write the journal and data disk at the same time, the journal here is just try to help performance, since it only do sequential write and we suspect it should be faster than backend OSD. 2. Have you got any data to prove the O_DSYNC or fdatasync kill the performance of journal? In our previous test, the journal SSD (use a partition of a SSD as a journal for a particular OSD, and 4 OSD share a same SSD) could reach its peak performance (300-400MB/s) Xiaoxi -----Original Message----- From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Somnath Roy Sent: Wednesday, September 17, 2014 3:30 PM To: 姚宁; ceph-devel@xxxxxxxxxxxxxxx Subject: RE: puzzled with the design pattern of ceph journal, really ruining performance Hi Nicheal, Not only recovery , IMHO the main purpose of ceph journal is to support transaction semantics since XFS doesn't have that. I guess it can't be achieved with pg_log/pg_info. Thanks & Regards Somnath -----Original Message----- From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of ?? Sent: Tuesday, September 16, 2014 11:29 PM To: ceph-devel@xxxxxxxxxxxxxxx Subject: puzzled with the design pattern of ceph journal, really ruining performance Hi, guys I analyze the architecture of the ceph souce code. I know that, in order to keep journal atomic and consistent, the journal write mode should be set with O_DSYNC or called fdatasync() system call after every write operation. However, this kind of operation is really killing the performance as well as achieving high committing latency, even if SSD is used as journal disk. If the SSD has capacitor to keep the data safe when the system crashes, we can set the mount option nobarrier or SSD itself will ignore the FLUSH REQUEST. So the performance would be better. So can it be instead by other strategies? As far as I am concerned, I think the most important part is pg_log and pg_info. It will guides the crashed osd recovery its objects from the peers. Therefore, if we can keep pg_log at a consistent point, we can recovery data without journal. So can we just use an "undo" strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, and also based on the consistent pg_log epoch, we can always recovery data from its peering osd, right? But this will lead to recovery more objects if the osd crash. Nicheal -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). N r y b X ǧv ^ ){.n + z ]z {ay ʇڙ ,j f h z w j:+v w j m zZ+ ݢj" ! i ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f