Re: Phantom hosts

Tim Holloway <timh@xxxxxxxxxxxxx> · Tue, 09 Jul 2024 08:44:36 -0400

Hi Eugen,

It's gone now, although similar artefacts seems to linger.

The reason it's gone is that I've been upgrading all my machines to
AlamLinux 8 from CentOS 7 and AlmaLinux 7, as one is already EOL and
the other is within days of it. Rather than upgrade-in-place, I chose
to nuke/replace the entire system disks and provision from scratch. It
helped me clean up my network and get rid of years of cruft.

Ceph helped a lot there, since I'd do one machine at a time, and since
the provisioning data is on Ceph, it was always available even as
individual machines went up and down.

I lost the phantom host, although for a while one of the newer OSDs
gave me issues. The container would die while starting claiming that
the OSD block (badly-quoted) was "already in user". I believe this
happened right after I moved the _admin node to that machine.

I finally got the failed machine back online by manually stopping the
systemd service, waiting a while, then starting (not restarting) it.
But some other nodes may have been rebooted in the interim, so it's
hard to be certain what actually made it happy. Annoyingly, the
dashboard and OSD tree listed the failed node as "up" and "in" even
thoiug "ceph orch ps" showed it as "error". I couldn't persuade it to
go down and out, or I would have destroyed and re-created it.

I did clear up a major mess, though. My original install/admin machine
was littered with dead and mangled objects. Two long-deleted OSDs left
traces, and there was a mix of pre-cephadm components (including) OSDs
and newer stuff.

I did discover a node still running Octopus which I plan to upgrade
today, but overall things look pretty clean, excepting the ever-
frustrating "too many PGs per OSD". If autotuning was supposed to auto-
fix this, it's not doing so, even though autotuning is switched on.
Manual changes don't seem to take either.

Going back to the phantom host situation, one thing I have seen is that
on the dashboard, the hosts display lists OSDs that have been deleted
as belonging to that machine. "ceph osd tree" and the OSD view disagree
and show neither the phantom host nor the deleted OSDs.

Just to recap, the original phantom host was a non-ceph node that
accidentally got sucked in when I did a host add with the wrong IP
address. It then claimed to own another host's OSD.

  Thanks,
    Tim

On Tue, 2024-07-09 at 06:08 +0000, Eugen Block wrote:
> Hi Tim,
> 
> is this still an issue? If it is, I recommend to add some more
> details  
> so it's easier to follow your train of thought.
> 
> ceph osd tree
> ceph -s
> ceph health detail
> ceph orch host ls
> 
> And then please point out which host you're trying to get rid of. I  
> would deal with the rgw thing later. Is it possible, that the
> phantom  
> host actually had OSDs on it? Maybe that needs to be cleaned up
> first.  
> I had something similar on a customer cluster recently where we
> hunted  
> failing OSDs but it turned out they were removed quite a while ago,  
> just not properly cleaned up yet on the filesystem.
> 
> Thanks,
> Eugen
> 
> Zitat von Tim Holloway <timholloway34@xxxxxxxxx>:
> 
> > It's getting worse.
> > 
> > As many may be aware, the venerable CentOS 7 OS is hitting end-of-
> > life in a
> > matter of days.
> > 
> > The easiest way to upgrade my serves has been to simply create an
> > alternate
> > disk with the new OS, turn my provisioning system loose on it, yank
> > the old
> > OS system disk and jack in the new one.
> > 
> > 
> > However, Ceph is another matter. For that part, the simplest thing
> > to do is
> > to destroy the Ceph node(s) on the affected box, do the OS upgrade,
> > then
> > re-create the nodes.
> > 
> > But now I have even MORE strays. The OSD on my box lives on in Ceph
> > in the
> > dashboard host view even though the documented removal procedures
> > were
> > followed and the VM itself was destroyed.
> > 
> > Further, this last node is an RGW node and I cannot remove it from
> > the RGW
> > configuration. It not only shows on the dashboard, it also lists as
> > still
> > active on the command line and as entries in the config database no
> > matter
> > what I do.
> > 
> > 
> > I really need some solution to this, as it's a major chokepoint in
> > the
> > upgrade process
> > 
> > 
> >    Tim
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx