Re: Ceph durability during outages

Nathan Fish <lordcirth@xxxxxxxxx> · Wed, 24 Jul 2019 14:59:29 -0400

It is inherently dangerous to accept client IO - particularly writes -
when at k. Just like it's dangerous to accept IO with 1 replica in
replicated mode. It is not inherently dangerous to do recovery when at
k, but apparently it was originally written to use min_size rather
than k.
Looking at the PR, the actual code change is fairly small, ~30 lines,
but it's a fairly critical change and has several pages of testing
code associated with it. It also requires setting
"osd_allow_recovery_below_min_size" just in case. So clearly it is
being treated with caution.

On Wed, Jul 24, 2019 at 2:28 PM Jean-Philippe Méthot
<jp.methot@xxxxxxxxxxxxxxxxx> wrote:
>
> Thank you, that does make sense. I was completely unaware that min size was k+1 and not k. Had I known that, I would have designed this pool differently.
>
> Regarding that feature for Octopus, I’m guessing it shouldn't be dangerous for data integrity to recover at less than min_size?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> Le 24 juill. 2019 à 13:49, Nathan Fish <lordcirth@xxxxxxxxx> a écrit :
>
> 2/3 monitors is enough to maintain quorum, as is any majority.
>
> However, EC pools have a default min_size of  k+1 chunks.
> This can be adjusted to k, but that has it's own dangers.
> I assume you are using failure domain = "host"?
> As you had k=6,m=2, and lost 2 failure domains, you had k chunks left,
> resulting in all IO stopping.
>
> Currently, EC pools that have k chunks but less than min_size do not rebuild.
> This is being worked on for Octopus: https://github.com/ceph/ceph/pull/17619
>
> k=6,m=2 is therefore somewhat slim for a 10-host cluster.
> I do not currently use EC, as I have only 3 failure domains, so others
> here may know better than me,
> but I might have done k=6,m=3. This would allow rebuilding to OK from
> 1 host failure, and remaining available in WARN state with 2 hosts
> down.
> k=4,m=4 would be very safe, but potentially too expensive.
>
>
> On Wed, Jul 24, 2019 at 1:31 PM Jean-Philippe Méthot
> <jp.methot@xxxxxxxxxxxxxxxxx> wrote:
>
>
> Hi,
>
> I’m running in production a 3 monitors, 10 osdnodes ceph cluster. This cluster is used to host Openstack VM rbd. My pools are set to use a k=6 m=2 erasure code profile with a 3 copy metadata pool in front. The cluster runs well, but we recently had a short outage which triggered unexpected behaviour in the cluster.
>
> I’ve always been under the impression that Ceph would continue working properly even if nodes would go down. I tested it several months ago with this configuration and it worked fine as long as only 2 nodes went down. However, this time, the first monitor as well as two osd nodes went down. As a result, Openstack VMs were able to mount their rbd volume but unable to read from it, even after the cluster had recovered with the following message : Reduced data availability: 599 pgs inactive, 599 pgs incomplete .
>
> I believe the cluster should have continued to work properly despite the outage, so what could have prevented it from functioning? Is it because there was only two monitors remaining? Or is it that reduced data availability message? In that case, is my erasure coding configuration fine for that number of nodes?
>
> Jean-Philippe Méthot
> Openstack system administrator
> Administrateur système Openstack
> PlanetHoster inc.
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com