On Fri, Aug 4, 2023 at 11:33 AM Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: > > Dave, > > Actually, my failure domain is OSD since I so far only have 9 OSD nodes but > EC 8+2. However, the drives are still functioning, except that one has > failed multiple times in the last few days, requiring a node power-cycle to > recover. I will certainly mark that one out immediately. > > The other two pending failures are behaving more politely, so I am assuming > that the cluster could copy the data elsewhere as part of the rebalance. I > think I'm also concerned about the rebalance process moving data to these > drives with pending failures. > > Since I'm EC 8+2, perhaps it is safe to mark two out simultaneously? Dave, You should be able to mark out two OSDs simultaneously without worry as long as you have enough space, etc. When you mark an OSD out, it still participates in the cluster as long as the OSD remains up and is able to aid in the backfilling process. Thus, you'll also want to avoid stopping/downing the OSDs until backfilling completes. Following that logic: if you stop both OSDs before backfilling completes, you will put yourself in a bad spot. If all PGs are active+clean, you may both a) out the two OSDs and b) stop/down *only the one* imminently failing OSD (leaving the second OSD being drained still up) and things should also be fine... but you will be vulnerable to blocked ops/unavailable data if _subsequent_ OSDs fail unexpectedly, including the second OSD being out'd, depending upon your CRUSH map and cluster status. Note that if your intent is to purge the OSD after it is drained, I believe you should do a `ceph osd crush reweight osd.X 0` and not an `ceph out osd.X` or `ceph osd reweight osd.X 0` as it should result in slightly less net data movement. Cheers, Tyler _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx