Re: How will Ceph cope with a failed Journal device?

Calvin Morrow <calvin.morrow@xxxxxxxxx> · Fri, 18 May 2012 12:36:13 -0600

I posted the same question to the list last week and never got a
reply.  In addition, I'd also like to know if there's a difference in
failure behavior between XFS backed Ceph (writeahead journaling) and
BTRFS backed Ceph (parallel journaling).

Calvin

On Fri, May 18, 2012 at 12:30 PM, Guido Winkelmann
<guido-ceph@xxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> We have been having a lot of discussions at my workplace about whether to
> employ a Ceph cluster in production or not, and if yes, how to set up the
> hardware for it. During that discussion, I mentioned that, according to the
> documentation, we should see significant speedups from using dedicated SSDs
> for the OSD's journals. Unfortunately, my colleagues did not like this idea at
> all - many of them had bad experiences with SSDs failing or at least read a
> lot about that on the Internet, and there's a general consensus that SSDs are
> just not quite reliable enough yet for production servers.
>
> This leads me to the question: What exactly can happen if an OSD's journal
> device suddenly fails during operations? Can that lead to data loss or
> corruption, or disruptions of the service?
>
> In my experience with the small three-machine test cluster I have here, a
> single failed node usually would lead to a pretty severe outage of the entire
> cluster on the order of ten minutes or more (probably much more when it's a
> really big node that fails), though so far no data loss or corruption...
>
> Regards,
>
>        Guido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html