----- Le 27 Fév 25, à 8:28, Frédéric Nass frederic.nass@xxxxxxxxxxxxxxxx a écrit : > ----- Le 26 Fév 25, à 16:40, Tim Holloway timh@xxxxxxxxxxxxx a écrit : > >> Thanks. I did resolve that problem, though I haven't had a chance to >> update until now. >> >> I had already attempted to use ceph orch to remove the daemons, but >> they didn't succeed. >> >> Fortunately, I was able to bring the host online, which allowed the >> scheduled removals to complete. I confirmed everything was drained, >> again removed the host from inventory and powered down. >> >> Still got complaints from cephadm about the decommissioned host. >> >> I took a break - impatience and ceph don't mix - and came back to >> address the next problem. which was lots of stuck PGs. Either because >> cephadm timed out or something kicked in when I started randomly >> rebooting OSDs. the host complaint finally disappeared. End of story. >> >> Now for what sent me down that path. >> >> I had 2 OSDs on one server and felt that that was probably not a good >> idea, so I marked one for deletion. 4 days later it was still in >> "destroying" state. More concerning, all signs indicated that despite >> having been reweighted to 0, the "destroying" OSD was still an >> essential participant and no indication that its PGs were being >> relocared to active servers. Shutting down the "destroying" OSD would >> immediately trigger a re-allocation panic, but that didn't clean >> anything. The re-allocation would proceed at a furious pace, then >> slowly stall out and hang, and the system was degraded. Restarting the >> OSD brought the PG inventory back up, but stuff still wasn't moving off >> the OSD, >> >> Right about that time I decommissioned the questionable host. >> >> Finally, I did a "ceph orch rm osd.x", and terminated the "destroying" >> permanently, making it finally disappear from the OSD tree list. >> >> I also deleted a number of OSD pools that are (hopefully) not going to >> be missed. >> >> Kicking and randomly repeatedly rebooting the other OSDs finally >> cleared all the stuck OSDs, some of which hadn't resolved in over 2 >> days. >> >> So at the moment, it's either rebalancing the cleaned-up OSDs or in a >> loop thinking that it is. > > Since you deleted some pools, it's probably the upmap balancer rebalancing PGs > across the OSDs. > >> And the PG/per-OSD count seems way too high, > > How much is it right now? With what hardware? > >> but the auto-sized doesn't seem to want to do anything about that. > > If the PG autoscaler is enabled you could try adjusting per pool settings [1] > and see if the # of PGs decreases. > If disabled you could manually reduce the number of PGs on the remaining pools > to lower the PG/OSD ratio. > > Regards, > Frédéric. [1] https://docs.ceph.com/en/latest/rados/operations/placement-groups/ Regards, Frédéric. > >> >> Of course, the whole shebang has been unavailable to clients this whole >> week because of that. >> >> I've been considering upgrading to reef, but recent posts regarding >> issues resembling what I've been going through are making me pause. >> >> Again, thanks! >> Tim >> >> On Wed, 2025-02-26 at 13:57 +0100, Frédéric Nass wrote: >>> Hi Tim, >>> >>> If you can't bring the host back online so that cephadm can remove >>> these services itself, I guess you'll have to clean up the mess by: >>> >>> - removing these services from the cluster (for example with a 'ceph >>> mon remove {mon-id}' for the monitor) >>> - forcing their removal from the orchestrator with the --force option >>> on the commands 'ceph orch daemon rm <names>' and 'ceph orch host rm >>> <hostname>'. If the --force option doesn't help, then looking >>> into/editing/removing ceph-config keys like 'mgr/cephadm/inventory' >>> and 'mgr/cephadm/host.ceph07.internal.mousetech.com' that 'ceph >>> config-key dump' output shows might help. >>> >>> Regards, >>> Frédéric. >>> >>> ----- Le 25 Fév 25, à 16:42, Tim Holloway timh@xxxxxxxxxxxxx a écrit >>> : >>> >>> > Ack. Another fine mess. >>> > >>> > I was trying to clean things up and the process of tossing around >>> > OSD's >>> > kept getting me reports of slow responses and hanging PG >>> > operations. >>> > >>> > This is Ceph Pacific, by the way. >>> > >>> > I found a deprecated server that claimed to have an OSD even though >>> > it >>> > didn't show in either "ceph osd tree" or the dashboard OSD list. I >>> > suspect that a lot of the grief came from it attempting to use >>> > resources that weren't always seen as resources. >>> > >>> > I shut down the server's OSD (removed the daemon using ceph orch), >>> > then >>> > foolishly deleted the server from the inventory without doing a >>> > drain >>> > first. >>> > >>> > Now cephadmin hates me (key not found), and there are still an MDS >>> > and >>> > MON listed as ceph orch ls daemons even after I powered the host >>> > off. >>> > >>> > I cannot do a ceph orch daemon delete because there's no longer an >>> > IP >>> > address available to the daemon delete, and I cannot clear the >>> > cephadmin queue: >>> > >>> > [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: >>> > 'ceph07.internal.mousetech.com' >>> > >>> > Any suggestions? >>> > _______________________________________________ >>> > ceph-users mailing list -- ceph-users@xxxxxxx >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx