You are basically listing all the reasons one shouldn't have too much misplacement at once. ;-) Your best bet probably is pgremapper [1] that I've recently learned about on this list. With `cancel-backfill`, you could stop any running backfill. With `undo-upmaps` you could then specifically start backfilling for those OSDs you want to destroy. The idea of pgremapper seems to be that the balancer will remove the upmaps over time, but since I'm still using the reweight-based balancer, I can't tell you if it really works that way. But since your misplacement is down to 0 as long as the upmaps are in place, the balancer definitely will do its work of mitigating nearfull OSDs. And AFAIK, setting the *weight* of a new OSD to 0 should prevent it from causing any rebalancing. However, this is different from reweighting it to 0 (third vs. sixth column in `ceph osd tree`)! Also, I don't see any advantage in setting the weight 0 over simply not yet creating the OSD. Best of luck! [1]: https://github.com/digitalocean/pgremapper ceph-users@xxxxxxxxxxxxxxxxx wrote: > Hello all, > > I am in the progress of adding and removing a number of OSDs in my cluster and I'm running in to some issues where it would be good to be able to control the system a bit better. I've tried the documentation and google-fu but have come up short. > > This is the background/scenario: I have a cluster that is/was working fine, had HEALTH_OK. I've added a number of new OSDs to the cluster, starting a lot of rebalancing. I also want to remove a number of OSDs from the cluster. Some of these OSDs have been marked out. The cluster has been rebalancing for more than two weeks and in state HEALTH_WARN. > > Inter-related issue 1 > While the cluster is rebalancing, I would like to prioritize migrating PGs from the OSDs that have been marked out. Even though they are marked as out, I can't stop them (down) and remove them (destroy/purge), since they still have remaining PGs. For instance, I've had about eight OSDs with between 3 and 7 PGs remaining (ceph osd safe-to-destroy <osd-id>) for over a week. As long as these handful of PGs are there, I can't remove those OSDs. I have set osd_max_backfulls, osd_recovery_max_active, osd_recovery_single_start and osd_recovery_sleep on the particular OSDs with no apparent affect, i.e. the PGs are still remaining. > > Is there a way to prioritize particular OSDs/PGs for rebalancing? > > Inter-related issue 2 > An alternative would be to just destroy the almost empty OSDs anyway, creating recovery activity instead of rebalancing. It doesn't seem like the recovery activity is prioritized over the rebalancing activity. > > Is there a way to ensure recovery activities are prioritized over rebalancing activities. > > Inter-related issue 3 > I spun up another OSD, marked it as up and out. This caused many additional PGs to become misplaced. Stopping and destroying the new, empty OSD again changed the number of misplaced PGs (returning to the previous amount/percentage). > > Can I prevent this by reweighting the OSDs to 0 in addition to marking them as out, or are there any other ways of preventing an OSD marked out to impact the balancing? > > > Inter-related issue 4 > During rebalancing, several smaller OSDs have become near full. Then one became full (>95%). This changed the cluster from HEALTH_WARN to HEALTH_ERR, stopping client activities. Reweighting the OSD and the near full OSDs did not change the cluster status. In essence, as far as I have understood it, all the data is there and available, the cluster is in the process of a massive rebalancing, PGs on the full OSD were misplaced and supposed to be moved elsewhere (in any case after the manual reweighting), so there should be no reason for the cluster to go to ERR. Also as a consequence of the cluster rebalancing for a long time, the balancer module is prevented from reweighting OSDs which could have prevented the ERR state (if the reweighting had had an impact). My solution, which had to be performed by manual intervention, was to mark the full OSD as out. The cluster changed back to HEALTH_WARN, client operations resumed and the rebalancing could continue in the background. > > Is there another way to handle a situation like this (an OSD becomes full, while having misplaced PGs on it, blocking the cluster)? > > Apologies for so many questions in the same email! They are all part of the same management activity for me. > > Many thanks! _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx