Re: Ceph newbee questions

Rich Freeman <r-ceph@xxxxxxxxxxxx> · Fri, 22 Dec 2023 22:34:24 +0000

Disclaimer: I'm fairly new to Ceph, but I've read a bunch of threads
on the min_size=1 issue as that was perplexing me when I started, as
one replica is generally considered fine in many other applications.
However, there really are some unique concerns to Ceph beyond just the
number of disks you can lose...

On Fri, Dec 22, 2023 at 3:09 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote:
>
> > 2 x "storage nodes"
>
> You can do that for a PoC, but that's a bad idea for any production workload.  You'd want at least three nodes with OSDs to use the default RF=3 replication.  You can do RF=2, but at the peril of your mortal data.

I'm not sure I agree - I think size=2, min_size=2 is no worse than
RAID1 for data security.  Maybe some consider RAID1 inappropriate, and
if so then size=2 is the same, but I think many are quite comfortable
with it.

The issue is that if you lose a disk your PGs become inactive - the
data is perfectly safe, but you have downtime.  Of course that is
probably not what you're expecting, as that isn't what happens with
RAID1.  Read on...

> > If just one data node goes down we will still not loose any
> > data and that is fine until we have a new server.
>
> ... unless one of the drives in the surviving node fails.
>

That isn't even the main risk as I understand it.  Of course a double
failure is going to be a problem with size=2, or traditional RAID1,
and I think anybody choosing this configuration accepts this risk.  As
I understand it, the reason min_size=1 is a trap has nothing to do
with double failures per se.

The issue is that Ceph OSDs are somewhat prone to flapping during
recovery (OOM, etc).  So even if the disk is fine, an OSD can go down
for a short time.  If you have size=2, min=1 configured, then when
this happens the PG will become degraded and will continue operating
on the other OSD, and the flapping OSD becomes stale.  Then when it
comes back up it recovers.  The problem is that if the other OSD has a
permanent failure (disk crash/etc) while the first OSD is flapping,
now you have no good OSDs, because when the flapping OSD comes back up
it is stale, and its PGs have no peer.  I suspect there are ways to
re-activate it, though this will result in potential data
inconsistency since writes were allowed to the cluster and will then
get rolled back.  With only two OSDs I'm guessing that would be the
main impact (well, depending on journaling behavior/etc), but if you
have more OSDs than that then you could have situations where one file
is getting rolled back, and some other file isn't, and so on.

With min_size=2 you're fairly safe from flapping because there will
always be two replicas that have the most recent version of every PG,
and so you can still tolerate a permanent failure of one of them.

size=2, min=2 doesn't suffer this failure mode, because anytime there
is flapping the PG goes inactive and no writes can be made, so when
the other OSD comes back up there is nothing to recover.  Of course
this results in IO blocks and downtime, which is obviously
undesirable, but it is likely a more recoverable state than
inconsistent writes.

Apologies if I've gotten any of that wrong, but my understanding is
that it is these sorts of failure modes that cause min_size=1 to be a
trap.  This isn't the sort of thing that typically happens in a RAID1
config, or at least that admins don't think about.  Those
implementations are simpler and less prone to flapping, though to its
credit I'm guessing ceph would be far better about detecting this sort
of thing in the first place.

> Ceph is a scale-out solution, not meant for a very small number of servers.  For replication, you really want at least 3 nodes with OSDs and size=3,min_size=2.  More nodes is better.  If you need a smaller-scale solution, DRBD or ZFS might be better choices.

Agree that pretty-much all distributed filesystems suffer performance
issues at the very least with a small number of nodes, especially with
hard disks.

Moosefs is a decent distributed solution on small clusters, but it
lacks high availability on the FOSS version.  I've found that on small
clusters on hard disks it actually performs much better than cephfs,
but it certainly won't scale up nearly as well.  With only 2-3 hard
disks though it still will perform fairly poorly.

Of course ZFS will perform best of all, but it lacks any kind of
host-level redundancy.

--
Rich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx