Re: Ceph newbee questions

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Fri, 22 Dec 2023 19:12:19 -0500

>> 
>> You can do that for a PoC, but that's a bad idea for any production workload.  You'd want at least three nodes with OSDs to use the default RF=3 replication.  You can do RF=2, but at the peril of your mortal data.
> 
> I'm not sure I agree - I think size=2, min_size=2 is no worse than
> RAID1 for data security.

size=2, min_size=2 *is* RAID1.  Except that you become unavailable if a single drive is unavailable.

>> That isn't even the main risk as I understand it.  Of course a double
> failure is going to be a problem with size=2, or traditional RAID1,
> and I think anybody choosing this configuration accepts this risk.

We see people often enough who don’t know that.  I’ve seen double failures.  ymmv.

>  As I understand it, the reason min_size=1 is a trap has nothing to do
> with double failures per se.

It’s one of the concerns.

> 
> The issue is that Ceph OSDs are somewhat prone to flapping during
> recovery (OOM, etc).  So even if the disk is fine, an OSD can go down
> for a short time.  If you have size=2, min=1 configured, then when
> this happens the PG will become degraded and will continue operating
> on the other OSD, and the flapping OSD becomes stale.  Then when it
> comes back up it recovers.  The problem is that if the other OSD has a
> permanent failure (disk crash/etc) while the first OSD is flapping,
> now you have no good OSDs, because when the flapping OSD comes back up
> it is stale, and its PGs have no peer.  

Indeed, arguably that’s an overlapping failure.  I’ve seen this too, and have a pg query to demonstrate it.

> I suspect there are ways to re-activate it, though this will result in potential data
> inconsistency since writes were allowed to the cluster and will then
> get rolled back.

Yep.

> With only two OSDs I'm guessing that would be the
> main impact (well, depending on journaling behavior/etc), but if you
> have more OSDs than that then you could have situations where one file
> is getting rolled back, and some other file isn't, and so on.

But you’d have a voting majority.

> 
> With min_size=2 you're fairly safe from flapping because there will
> always be two replicas that have the most recent version of every PG,
> and so you can still tolerate a permanent failure of one of them.

Exactly.

> 
> size=2, min=2 doesn't suffer this failure mode, because anytime there
> is flapping the PG goes inactive and no writes can be made, so when
> the other OSD comes back up there is nothing to recover.  Of course
> this results in IO blocks and downtime, which is obviously
> undesirable, but it is likely a more recoverable state than
> inconsistent writes.

Agreed, the difference between availability and durability.  Depends what’s important to you.

> 
> Apologies if I've gotten any of that wrong, but my understanding is
> that it is these sorts of failure modes that cause min_size=1 to be a
> trap.  This isn't the sort of thing that typically happens in a RAID1
> config, or at least that admins don't think about.

It’s both.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx