----- Le 26 Fév 25, à 16:40, Tim Holloway timh@xxxxxxxxxxxxx a écrit : > Thanks. I did resolve that problem, though I haven't had a chance to > update until now. > > I had already attempted to use ceph orch to remove the daemons, but > they didn't succeed. > > Fortunately, I was able to bring the host online, which allowed the > scheduled removals to complete. I confirmed everything was drained, > again removed the host from inventory and powered down. > > Still got complaints from cephadm about the decommissioned host. > > I took a break - impatience and ceph don't mix - and came back to > address the next problem. which was lots of stuck PGs. Either because > cephadm timed out or something kicked in when I started randomly > rebooting OSDs. the host complaint finally disappeared. End of story. > > Now for what sent me down that path. > > I had 2 OSDs on one server and felt that that was probably not a good > idea, so I marked one for deletion. 4 days later it was still in > "destroying" state. More concerning, all signs indicated that despite > having been reweighted to 0, the "destroying" OSD was still an > essential participant and no indication that its PGs were being > relocared to active servers. Shutting down the "destroying" OSD would > immediately trigger a re-allocation panic, but that didn't clean > anything. The re-allocation would proceed at a furious pace, then > slowly stall out and hang, and the system was degraded. Restarting the > OSD brought the PG inventory back up, but stuff still wasn't moving off > the OSD, > > Right about that time I decommissioned the questionable host. > > Finally, I did a "ceph orch rm osd.x", and terminated the "destroying" > permanently, making it finally disappear from the OSD tree list. > > I also deleted a number of OSD pools that are (hopefully) not going to > be missed. > > Kicking and randomly repeatedly rebooting the other OSDs finally > cleared all the stuck OSDs, some of which hadn't resolved in over 2 > days. > > So at the moment, it's either rebalancing the cleaned-up OSDs or in a > loop thinking that it is. Since you deleted some pools, it's probably the upmap balancer rebalancing PGs across the OSDs. > And the PG/per-OSD count seems way too high, How much is it right now? With what hardware? > but the auto-sized doesn't seem to want to do anything about that. If the PG autoscaler is enabled you could try adjusting per pool settings [1] and see if the # of PGs decreases. If disabled you could manually reduce the number of PGs on the remaining pools to lower the PG/OSD ratio. Regards, Frédéric. > > Of course, the whole shebang has been unavailable to clients this whole > week because of that. > > I've been considering upgrading to reef, but recent posts regarding > issues resembling what I've been going through are making me pause. > > Again, thanks! > Tim > > On Wed, 2025-02-26 at 13:57 +0100, Frédéric Nass wrote: >> Hi Tim, >> >> If you can't bring the host back online so that cephadm can remove >> these services itself, I guess you'll have to clean up the mess by: >> >> - removing these services from the cluster (for example with a 'ceph >> mon remove {mon-id}' for the monitor) >> - forcing their removal from the orchestrator with the --force option >> on the commands 'ceph orch daemon rm <names>' and 'ceph orch host rm >> <hostname>'. If the --force option doesn't help, then looking >> into/editing/removing ceph-config keys like 'mgr/cephadm/inventory' >> and 'mgr/cephadm/host.ceph07.internal.mousetech.com' that 'ceph >> config-key dump' output shows might help. >> >> Regards, >> Frédéric. >> >> ----- Le 25 Fév 25, à 16:42, Tim Holloway timh@xxxxxxxxxxxxx a écrit >> : >> >> > Ack. Another fine mess. >> > >> > I was trying to clean things up and the process of tossing around >> > OSD's >> > kept getting me reports of slow responses and hanging PG >> > operations. >> > >> > This is Ceph Pacific, by the way. >> > >> > I found a deprecated server that claimed to have an OSD even though >> > it >> > didn't show in either "ceph osd tree" or the dashboard OSD list. I >> > suspect that a lot of the grief came from it attempting to use >> > resources that weren't always seen as resources. >> > >> > I shut down the server's OSD (removed the daemon using ceph orch), >> > then >> > foolishly deleted the server from the inventory without doing a >> > drain >> > first. >> > >> > Now cephadmin hates me (key not found), and there are still an MDS >> > and >> > MON listed as ceph orch ls daemons even after I powered the host >> > off. >> > >> > I cannot do a ceph orch daemon delete because there's no longer an >> > IP >> > address available to the daemon delete, and I cannot clear the >> > cephadmin queue: >> > >> > [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed: >> > 'ceph07.internal.mousetech.com' >> > >> > Any suggestions? >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx