Re: [SOLVED] Replicated pool with an even size - has min_size to be bigger than half the size?

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 29 Mar 2018 14:54:25 +0200

Guys,

Ceph does not have a concept of "osd quorum" or "electing a primary
PG". The mons are in a PAXOS quorum, and the mon leader decides which
OSD is primary for each PG. No need to worry about a split OSD brain.

-- dan

On Thu, Mar 29, 2018 at 2:51 PM, Peter Linder
<peter.linder@xxxxxxxxxxxxxx> wrote:
>
>
> Den 2018-03-29 kl. 14:26, skrev David Rabel:
>
> On 29.03.2018 13:50, Peter Linder wrote:
>
> Den 2018-03-29 kl. 12:29, skrev David Rabel:
>
> On 29.03.2018 12:25, Janne Johansson wrote:
>
> 2018-03-29 11:50 GMT+02:00 David Rabel <rabel@xxxxxxxxxxxxx>:
>
> You are right. But with my above example: If I have min_size 2 and size
> 4, and because of a network issue the 4 OSDs are split into 2 and 2, is
> it possible that I have write operations on both sides and therefore
> have inconsistent data?
>
> You always write to the primary, which in turn sends copies to the 3
> others,
> so in the 2+2 split case, only one side can talk to the primary OSD for
> that pg,
> so writes will just happen on one side at most.
>
> I'm not sure that this is true, will not the side that doesn't have the
> primary simply elect a new one when min_size=2 and there are 2 of
> [failure domain] available? This is assuming that there are enough mon's
> also.
>
> Even if this is the case, only half of the PGs would be available and
> operations will stop.
>
> Why is this? If min_size is 2 and 2 PGs are available, operations should
> not stop. Or am I wrong here?
>
> Yes, but there are 2 OSDs available per PG per side of the partition, so 2
> separate active clusters. If there is a different write to both of them it
> will be accepted and it will not be possible to heal the cluster later when
> the network issue is resolved because of inconsistency.
>
> Even if it was a 50/50 chance on which side a PG would be active (going by
> the original primary) it would mean trouble as many writes could not
> complete, but I don't think this is the case.
>
> You will have to take into account  mon quorum as well of course, it is
> outside of my post. Best think I believe is to have an uneven number of
> everything. I don't know if you can have 4 OSD hosts and a 5:th node for
> quorum, I suppose it woule be worth it if the extra quorum node could not
> fail at the same time as 2 of the hosts.
>
>
> David
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com