>> >> You can do that for a PoC, but that's a bad idea for any production workload. You'd want at least three nodes with OSDs to use the default RF=3 replication. You can do RF=2, but at the peril of your mortal data. > > I'm not sure I agree - I think size=2, min_size=2 is no worse than > RAID1 for data security. size=2, min_size=2 *is* RAID1. Except that you become unavailable if a single drive is unavailable. >> That isn't even the main risk as I understand it. Of course a double > failure is going to be a problem with size=2, or traditional RAID1, > and I think anybody choosing this configuration accepts this risk. We see people often enough who don’t know that. I’ve seen double failures. ymmv. > As I understand it, the reason min_size=1 is a trap has nothing to do > with double failures per se. It’s one of the concerns. > > The issue is that Ceph OSDs are somewhat prone to flapping during > recovery (OOM, etc). So even if the disk is fine, an OSD can go down > for a short time. If you have size=2, min=1 configured, then when > this happens the PG will become degraded and will continue operating > on the other OSD, and the flapping OSD becomes stale. Then when it > comes back up it recovers. The problem is that if the other OSD has a > permanent failure (disk crash/etc) while the first OSD is flapping, > now you have no good OSDs, because when the flapping OSD comes back up > it is stale, and its PGs have no peer. Indeed, arguably that’s an overlapping failure. I’ve seen this too, and have a pg query to demonstrate it. > I suspect there are ways to re-activate it, though this will result in potential data > inconsistency since writes were allowed to the cluster and will then > get rolled back. Yep. > With only two OSDs I'm guessing that would be the > main impact (well, depending on journaling behavior/etc), but if you > have more OSDs than that then you could have situations where one file > is getting rolled back, and some other file isn't, and so on. But you’d have a voting majority. > > With min_size=2 you're fairly safe from flapping because there will > always be two replicas that have the most recent version of every PG, > and so you can still tolerate a permanent failure of one of them. Exactly. > > size=2, min=2 doesn't suffer this failure mode, because anytime there > is flapping the PG goes inactive and no writes can be made, so when > the other OSD comes back up there is nothing to recover. Of course > this results in IO blocks and downtime, which is obviously > undesirable, but it is likely a more recoverable state than > inconsistent writes. Agreed, the difference between availability and durability. Depends what’s important to you. > > Apologies if I've gotten any of that wrong, but my understanding is > that it is these sorts of failure modes that cause min_size=1 to be a > trap. This isn't the sort of thing that typically happens in a RAID1 > config, or at least that admins don't think about. It’s both. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx