puzzled with the design pattern of ceph journal, really ruining performance

姚宁 <zay11022@xxxxxxxxx> · Wed, 17 Sep 2014 14:29:22 +0800

Hi, guys

I analyze the architecture of the ceph souce code.

I know that, in order to keep journal atomic and consistent, the
journal write mode should be set with O_DSYNC or called fdatasync()
system call after every write operation. However, this kind of
operation is really killing the performance as well as achieving high
committing latency, even if SSD is used as journal disk. If the SSD
has capacitor to keep the data safe when the system crashes, we can
set the mount option nobarrier or SSD itself will ignore the FLUSH
REQUEST. So the performance would be better.

So can it be instead by other strategies?
As far as I am concerned, I think the most important part is pg_log
and pg_info. It will guides the crashed osd recovery its objects from
the peers. Therefore, if we can keep pg_log at a consistent point, we
can recovery data without journal. So can we just use an "undo"
strategy on pg_log and neglect ceph journal?  It will save lots of
bandwidth, and also based on the consistent pg_log epoch, we can
always recovery data from its peering osd, right? But this will lead
to recovery more objects if the osd crash.

Nicheal
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html