Re: Ceph durability during outages

Nathan Fish <lordcirth@xxxxxxxxx> · Wed, 24 Jul 2019 13:49:56 -0400

2/3 monitors is enough to maintain quorum, as is any majority.

However, EC pools have a default min_size of  k+1 chunks.
This can be adjusted to k, but that has it's own dangers.
I assume you are using failure domain = "host"?
As you had k=6,m=2, and lost 2 failure domains, you had k chunks left,
resulting in all IO stopping.

Currently, EC pools that have k chunks but less than min_size do not rebuild.
This is being worked on for Octopus: https://github.com/ceph/ceph/pull/17619

k=6,m=2 is therefore somewhat slim for a 10-host cluster.
I do not currently use EC, as I have only 3 failure domains, so others
here may know better than me,
but I might have done k=6,m=3. This would allow rebuilding to OK from
1 host failure, and remaining available in WARN state with 2 hosts
down.
k=4,m=4 would be very safe, but potentially too expensive.

On Wed, Jul 24, 2019 at 1:31 PM Jean-Philippe Méthot
<jp.methot@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This cluster is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 erasure code profile with a 3 copy metadata pool in front. The cluster runs well, but we recently had a short outage which triggered unexpected behaviour in the cluster.
>
> I’ve always been under the impression that Ceph would continue working properly even if nodes would go down. I tested it several months ago with this configuration and it worked fine as long as only 2 nodes went down. However, this time, the first monitor as well as two osd nodes went down. As a result, Openstack VMs were able to mount their rbd volume but unable to read from it, even after the cluster had recovered with the following message : Reduced data availability: 599 pgs inactive, 599 pgs incomplete .
>
> I believe the cluster should have continued to work properly despite the outage, so what could have prevented it from functioning? Is it because there was only two monitors remaining? Or is it that reduced data availability message? In that case, is my erasure coding configuration fine for that number of nodes?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com