How will Ceph cope with a failed Journal device?

Guido Winkelmann <guido-ceph@xxxxxxxxxxxxxxxxx> · Fri, 18 May 2012 20:30:58 +0200

Hi,

We have been having a lot of discussions at my workplace about whether to 
employ a Ceph cluster in production or not, and if yes, how to set up the 
hardware for it. During that discussion, I mentioned that, according to the 
documentation, we should see significant speedups from using dedicated SSDs 
for the OSD's journals. Unfortunately, my colleagues did not like this idea at 
all - many of them had bad experiences with SSDs failing or at least read a 
lot about that on the Internet, and there's a general consensus that SSDs are 
just not quite reliable enough yet for production servers.

This leads me to the question: What exactly can happen if an OSD's journal 
device suddenly fails during operations? Can that lead to data loss or 
corruption, or disruptions of the service?

In my experience with the small three-machine test cluster I have here, a 
single failed node usually would lead to a pretty severe outage of the entire 
cluster on the order of ten minutes or more (probably much more when it's a 
really big node that fails), though so far no data loss or corruption...

Regards,

	Guido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html