Re: Schrödinger's Server

Tim Holloway <timh@xxxxxxxxxxxxx> · Wed, 26 Feb 2025 10:40:58 -0500

Thanks. I did resolve that problem, though I haven't had a chance to
update until now.

I had already attempted to use ceph orch to remove the daemons, but
they didn't succeed.

Fortunately, I was able to bring the host online, which allowed the
scheduled removals to complete. I confirmed everything was drained,
again removed the host from inventory and powered down.

Still got complaints from cephadm about the decommissioned host.

I took a break - impatience and ceph don't mix - and came back to
address the next problem. which was lots of stuck PGs. Either because
cephadm timed out or something kicked in when I started randomly
rebooting OSDs. the host complaint finally disappeared. End of story.

Now for what sent me down that path.

I had 2 OSDs on one server and felt that that was probably not a good
idea, so I marked one for deletion. 4 days later it was still in
"destroying" state. More concerning, all signs indicated that despite
having been reweighted to 0, the "destroying" OSD was still an
essential participant and no indication that its PGs were being
relocared to active servers. Shutting down the "destroying" OSD would
immediately trigger a re-allocation panic, but that didn't clean
anything. The re-allocation would proceed at a furious pace, then
slowly stall out and hang, and the system was degraded. Restarting the
OSD brought the PG inventory back up, but stuff still wasn't moving off
the OSD,

Right about that time I decommissioned the questionable host.

Finally, I did a "ceph orch rm osd.x", and terminated the "destroying"
permanently, making it finally disappear from the OSD tree list.

I also deleted a number of OSD pools that are (hopefully) not going to
be missed.

Kicking and randomly repeatedly rebooting the other OSDs finally
cleared all the stuck OSDs, some of which hadn't resolved in over 2
days.

So at the moment, it's either rebalancing the cleaned-up OSDs or in a
loop thinking that it is. And the PG/per-OSD count seems way too high,
but the auto-sized doesn't seem to want to do anything about that.

Of course, the whole shebang has been unavailable to clients this whole
week because of that.

I've been considering upgrading to reef, but recent posts regarding
issues resembling what I've been going through are making me pause.

  Again, thanks!
    Tim

On Wed, 2025-02-26 at 13:57 +0100, Frédéric Nass wrote:
> Hi Tim,
> 
> If you can't bring the host back online so that cephadm can remove
> these services itself, I guess you'll have to clean up the mess by:
> 
> - removing these services from the cluster (for example with a 'ceph
> mon remove {mon-id}' for the monitor)
> - forcing their removal from the orchestrator with the --force option
> on the commands 'ceph orch daemon rm <names>' and 'ceph orch host rm
> <hostname>'. If the --force option doesn't help, then looking
> into/editing/removing ceph-config keys like 'mgr/cephadm/inventory'
> and 'mgr/cephadm/host.ceph07.internal.mousetech.com' that 'ceph
> config-key dump' output shows might help.
> 
> Regards,
> Frédéric.
> 
> ----- Le 25 Fév 25, à 16:42, Tim Holloway timh@xxxxxxxxxxxxx a écrit
> :
> 
> > Ack. Another fine mess.
> > 
> > I was trying to clean things up and the process of tossing around
> > OSD's
> > kept getting me reports of slow responses and hanging PG
> > operations.
> > 
> > This is Ceph Pacific, by the way.
> > 
> > I found a deprecated server that claimed to have an OSD even though
> > it
> > didn't show in either "ceph osd tree" or the dashboard OSD list. I
> > suspect that a lot of the grief came from it attempting to use
> > resources that weren't always seen as resources.
> > 
> > I shut down the server's OSD (removed the daemon using ceph orch),
> > then
> > foolishly deleted the server from the inventory without doing a
> > drain
> > first.
> > 
> > Now cephadmin hates me (key not found), and there are still an MDS
> > and
> > MON listed as ceph orch ls daemons even after I powered the host
> > off.
> > 
> > I cannot do a ceph orch daemon delete because there's no longer an
> > IP
> > address available to the daemon delete, and I cannot clear the
> > cephadmin queue:
> > 
> > [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed:
> > 'ceph07.internal.mousetech.com'
> > 
> > Any suggestions?
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx