Re: Is ceph itself a single point of failure?

Marc <Marc@xxxxxxxxxxxxxxxxx> · Mon, 22 Nov 2021 10:23:21 +0000

> 
> Many of us deploy ceph as a solution to storage high-availability.
> 
> During the time, I've encountered a couple of moments when ceph refused
> to
> deliver I/O to VMs even when a tiny part of the PGs were stuck in
> non-active states due to challenges on the OSDs.

I do not know what you mean by this, you can tune this with your min size and replication. It is hard to believe that exactly harddrives fail in the same pg. I wonder if this is not more related to your 'non-default' config?

> So I found myself in very unpleasant situations when an entire cluster
> went
> down because of 1 single node, even if that cluster was supposed to be
> fault-tolerant.

That is also very hard to believe, since I am updating ceph and reboot one node at time, which is just going fine.

> 
> Regardless of the reason, the cluster itself can be a single point of
> failure, even if it's has a lot of nodes.

Indeed, like the data center, and like the planet. The question you should ask yourself, do you have a better alternative? For the 3-4 years I have been using ceph, I did not find a better alternative (also not looking for it ;))

> How do you segment your deployments so that your business doesn't
> get jeopardised in the case when your ceph cluster misbehaves?
> 
> Does anyone even use ceph for a very large clusters, or do you prefer to
> separate everything into smaller clusters?

If you would read and investigate, you would not need to ask this question. 
Is your lack of knowledge of ceph maybe a critical issue? I know the ceph organization likes to make everything as simple as possible for everyone. But this has of course its flip side when users run into serious issues.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx