Disclaimer: I'm fairly new to Ceph, but I've read a bunch of threads on the min_size=1 issue as that was perplexing me when I started, as one replica is generally considered fine in many other applications. However, there really are some unique concerns to Ceph beyond just the number of disks you can lose... On Fri, Dec 22, 2023 at 3:09 PM Anthony D'Atri <aad@xxxxxxxxxxxxxx> wrote: > > > 2 x "storage nodes" > > You can do that for a PoC, but that's a bad idea for any production workload. You'd want at least three nodes with OSDs to use the default RF=3 replication. You can do RF=2, but at the peril of your mortal data. I'm not sure I agree - I think size=2, min_size=2 is no worse than RAID1 for data security. Maybe some consider RAID1 inappropriate, and if so then size=2 is the same, but I think many are quite comfortable with it. The issue is that if you lose a disk your PGs become inactive - the data is perfectly safe, but you have downtime. Of course that is probably not what you're expecting, as that isn't what happens with RAID1. Read on... > > If just one data node goes down we will still not loose any > > data and that is fine until we have a new server. > > ... unless one of the drives in the surviving node fails. > That isn't even the main risk as I understand it. Of course a double failure is going to be a problem with size=2, or traditional RAID1, and I think anybody choosing this configuration accepts this risk. As I understand it, the reason min_size=1 is a trap has nothing to do with double failures per se. The issue is that Ceph OSDs are somewhat prone to flapping during recovery (OOM, etc). So even if the disk is fine, an OSD can go down for a short time. If you have size=2, min=1 configured, then when this happens the PG will become degraded and will continue operating on the other OSD, and the flapping OSD becomes stale. Then when it comes back up it recovers. The problem is that if the other OSD has a permanent failure (disk crash/etc) while the first OSD is flapping, now you have no good OSDs, because when the flapping OSD comes back up it is stale, and its PGs have no peer. I suspect there are ways to re-activate it, though this will result in potential data inconsistency since writes were allowed to the cluster and will then get rolled back. With only two OSDs I'm guessing that would be the main impact (well, depending on journaling behavior/etc), but if you have more OSDs than that then you could have situations where one file is getting rolled back, and some other file isn't, and so on. With min_size=2 you're fairly safe from flapping because there will always be two replicas that have the most recent version of every PG, and so you can still tolerate a permanent failure of one of them. size=2, min=2 doesn't suffer this failure mode, because anytime there is flapping the PG goes inactive and no writes can be made, so when the other OSD comes back up there is nothing to recover. Of course this results in IO blocks and downtime, which is obviously undesirable, but it is likely a more recoverable state than inconsistent writes. Apologies if I've gotten any of that wrong, but my understanding is that it is these sorts of failure modes that cause min_size=1 to be a trap. This isn't the sort of thing that typically happens in a RAID1 config, or at least that admins don't think about. Those implementations are simpler and less prone to flapping, though to its credit I'm guessing ceph would be far better about detecting this sort of thing in the first place. > Ceph is a scale-out solution, not meant for a very small number of servers. For replication, you really want at least 3 nodes with OSDs and size=3,min_size=2. More nodes is better. If you need a smaller-scale solution, DRBD or ZFS might be better choices. Agree that pretty-much all distributed filesystems suffer performance issues at the very least with a small number of nodes, especially with hard disks. Moosefs is a decent distributed solution on small clusters, but it lacks high availability on the FOSS version. I've found that on small clusters on hard disks it actually performs much better than cephfs, but it certainly won't scale up nearly as well. With only 2-3 hard disks though it still will perform fairly poorly. Of course ZFS will perform best of all, but it lacks any kind of host-level redundancy. -- Rich _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx