Thanks for all the replies folks. I think it's testament to the versatility of Ceph that there are some differences of opinion and experience here. With regards to the purpose of this cluster, it is providing distributed storage for stateful workloads of containers. The data produced is somewhat immutable, it can be regenerated over time, however that does cause some slowdown for the teams that use the data as part of their development pipeline. To the best of my understanding the goals here were to provide a data loss safety net but still make efficient use of the block devices assigned to the cluster, which is I imagine where the EC direction came from. The cluster is 3 nodes with the OSDs themselves mainly housed in two of those. Additionally there was an initiative to 'use what we have' (or as I like to put it, 'cobble it together') with commodity hardware that was immediately available to hand. The departure of my predecessor has left some unanswered questions so I am not going to bother second guessing beyond what I already know. As I understand it my steps are: 1: Move off the data and scrap the cluster as it stands currently. (already under way) 2: Group the block devices into pools of the same geometry and type (and maybe do some tiering?) 3. Spread the OSDs across all 3 nodes so recovery scope isn't so easily compromised by a loss at the bare metal level 4. Add more hosts/OSDs if EC is the right solution (this may be outside of the scope of this implementation, but I'll keep a-cobblin'!) The additional ceph outputs follow: ceph osd tree <https://termbin.com/vq63> ceph osd erasure-code-profile get cephfs-media-ec <https://termbin.com/h33h> I am fully prepared to do away with EC to keep things simple and efficient in terms of CPU occupancy. On Mon, 16 Nov 2020 at 02:32, Janne Johansson <icepic.dz@xxxxxxxxx> wrote: > Den mån 16 nov. 2020 kl 10:54 skrev Hans van den Bogert < > hansbogert@xxxxxxxxx>: > > > > With this profile you can only loose one OSD at a time, which is really > > > not that redundant. > > That's rather situation dependent. I don't have really large disks, so > > the repair time isn't that large. > > Further, my SLO isn't that high that I need 99.xxx% uptime, if 2 disks > > break in the same repair window, that would be unfortunate, but I'd just > > grab a backup from a mirroring cluster. Looking at it from another > > perspective, I came from a single host RAID5 scenario, I'd argue this is > > better since I can survive a host failure. > > > > Also this is a sliding problem right? Someone with K+3 could argue K+2 > > is not enough as well. > > > > There are a few situations like when you are moving data or when a scrub > found a bad PG where you are suddenly out of copies in case something bad > happens. I think Raid5 operators also found this out, when your cold spare > disk kicks in, you find that old undetected error on one of the other disks > and think repairs are bad or stress your raid too much. > > As with raids, the cheapest resource is often the actual disks and not > operator time, restore-wait-times and so on, so that is why many on this > list advocates for K+2-or-more, or Repl=3 because we have seen the errors > one normally didn't expect. Yes, a double surprise of two disks failing in > the same night after running for years is uncommon, but it is not as > uncommon to resize pools, move PGs around or find a scrub error or two some > day. > > So while one could always say "one more drive is better than your amount", > there are people losing data with repl=2 or K+1 because some more normal > operation was in flight and _then_ a single surprise happens. So you can > have a weird reboot, causing those PGs needing backfill later, and if one > of the uptodate hosts have any single surprise during the recovery, the > cluster will lack some of the current data even if two disks were never > down at the same time. > > Drive manufacturers print Mean Time Between Failures, storage admins count > Mean Time Between Surprises.. > > -- > May the most significant bit of your life be positive. > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx