Re: Default min_size value for EC pools

Frank Schilder <frans@xxxxxx> · Tue, 21 May 2019 08:16:08 +0000

Hi Poul,

maybe we misunderstood each other here or I'm misunderstanding something. My HA comment was not on PGs becoming active/inactive or data loss.

As far as I understand the discussions, the OSD flapping itself may be caused by the 2-member HA group, because the OSDs keep marking each other out and themselves in continuously. As far as I saw, this type of OSD flapping is never observed when only using pools with size>=3 and min_size>size/2 (strictly larger than), because this min_size setting will always ensure a stable (>50%) majority of votes of HA members that cannot be questioned by a single OSD trying to mark itself as in.

At least the only context I have heard of OSD flapping was in connection to 2/1-pools. I have never seen such a report for, say, 3/2 pools. Am I overlooking something here?

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Paul Emmerich <paul.emmerich@xxxxxxxx>
Sent: 21 May 2019 09:45
To: Frank Schilder
Cc: Florent B; ceph-users
Subject: Re:  Default min_size value for EC pools

No, there is no split brain problem even with size/mine_size 2/1. A PG will not go active if it doesn't have the latest data because all other OSDs that might have seen writes are currently offline.
That's what the history_ignore_les_bounds option effectively does: it tells ceph to take a PG active anyways in that situation.

That's why you end up with inactive PGs if you run 2/1 and a disk dies while OSDs flap. You  then have to set history_ignore_les_bounds if the dead disk is really unrecoverable, losing the latest modifcations to an object.
But Ceph will not compromise your data without you manually telling it to do so, it will just block IO instead.

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90

On Mon, May 20, 2019 at 10:04 PM Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> wrote:
If min_size=1 and you loose the last disk, that's end of any data that was only on this disk.

Apart from this, using size=2 and min_size=1 is a really bad idea. This has nothing to do with data replication but rather with an inherent problem with high availability and the number 2. You need at least 3 members of an HA group to ensure stable operation with proper majorities. There are numerous stories about OSD flapping caused by size-2 min_size-1 pools, leading to situations that are extremely hard to recover from. My favourite is this one: https://blog.noc.grnet.gr/2016/10/18/surviving-a-ceph-cluster-outage-the-hard-way/ . You will easily find more. The deeper problem here is called "split-brain" and there is no real solution to it except to avoid it at all cost.

Best regards,

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Florent B <florent@xxxxxxxxxxx<mailto:florent@xxxxxxxxxxx>>
Sent: 20 May 2019 21:33
To: Paul Emmerich; Frank Schilder
Cc: ceph-users
Subject: Re:  Default min_size value for EC pools

I understand better thanks to Frank & Paul messages.

Paul, when min_size=k, is it the same problem with replicated pool size=2 & min_size=1 ?

On 20/05/2019 21:23, Paul Emmerich wrote:
Yeah, the current situation with recovery and min_size is... unfortunate :(

The reason why min_size = k is bad is just that it means you are accepting writes without guaranteeing durability while you are in a degraded state.
A durable storage system should never tell a client "okay, i've written your data" if losing a single disk leads to data loss.

Yes, that is the default behavior of traditional raid 5 and raid 6 systems during rebuild (with 1 or 2 disk failures for raid 5/6), but that doesn't mean it's a good idea.

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io><http://www.croit.io>
Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com