Re: OSD refuse to start

Albert Shih <Albert.Shih@xxxxxxxx> · Wed, 6 Nov 2024 11:42:47 +0100

Le 05/11/2024 à 13:59:34-0500, Tim Holloway a écrit

Hi, 

Thanks for you long answer. 

> That can be a bit sticky.
> 
> First, check to see if you have a /var/log/messages file. The dmesg log
> isn't always as complete.

Forget to say, no relevant message at least for my understanding, just
message very close to what systemctl give me. The container start and died. 

> Also, of course, make sure you have enough spare RAM and disk space to
> run the OSD. When running a Managed OSD, a LOT of space is used under
> the root directory several layers down, where the container image and
> support data are kept. A 'df -h /' should give you a clue if there's a
> problem there. 

No problem around that, the server was running without issue before de
reboot. 

> 
> Failing that, the next consideration is which type of OSD is this? Is
> is a legacy OSD or a managed (containerized) OSB? Or, Ganesha forbid,
> both?

managed, with podman and cephadm

> 
> Legacy OSDs are defined in /var/lib/ceph. Containerized OSDs are
> defined in /var/lib/ceph/{fsid}. If the same OSD number is defined in
> BOTH places, congratulations! Welcome to the schizophrenic OSD club!
> 
> Cleaning up a dual-defined OSD is rather daunting, and I recommend
> scanning this message list for stuff with my name on it from back
> around this June, where I slogged my own way through that problem.
> 
> I'm FAIRLY sure that if you stop Ceph, delete the /var/lib/ceph OSD and
> restart that things will clear up. Back up your data first, though!
> Just in case. The actual OSD data is not contained within that
> directory, but rather soft-linked. But why take chances?
> 
> If you did that, reboot and do a "systemctl --failed". If you see a
> failed OSD unit and it doesn't have your fsid in its name, then you
> probably will find a legacy systemd unit in /etc/ceph that you should
> delete. That should clear you up.
> 
> Now, on the other hand, let's suppose you have no legacy OSD metadata,
> just a Managed OSD. In that case, I'd reweight the OSD and let the
> system reach equilibrium. Check via "ceph osd tree" to make sure things
> are set properly. The ""ceph orch ps" command is also a friend to make
> sure you know what's running and what isn't.
> 
> Once (if!) the OSD has drained, the cleanest approach would be to
> destroy and re-create it anew.
> 
> So much for simple solutions. Managed OSDs are under systemd control,
> but there is so permanent systemd unit file for them..Instead a
> template common to all OSDs is injected with the OSD number and the
> resulting unit goes into a volatile directory (/var/run/systemd, I
> think). In the normal course of events, the generic Ceph container
> image is launched via docker or podman. The launch options by defauly
> include the '-d' option which destroys  the container when it is
> stopped. Which also destroys the container's log!
> 
> In a properly functioning system, this doesn't matter, since Ceph
> redirects the container log to the systemd journal. Where it gets
> problematic is if the container fails before that point is reached.
> 
> I've diagnosed Ceph problems buy doing a "brute force" launch of a Ceph
> container without the "-d" option, but it's not for the faint of heart.

;-) ;-)

What I don't understand is why after another reboot everything work again
(until now without issue). 

Regards.

JAS

> On Tue, 2024-11-05 at 15:25 +0100, Albert Shih wrote:
> > Hi everyone,
> > 
> > 
> > I'm running currently ceph reef 18.2.4 with podman with Debian 12
> > 
> > Today I've to reboot the cluster (firmware alerts), everything goes
> > as the
> > plan but....one osd on one server won't start. I try everything in 
> > 
> >  
> > https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-osd/
> > 
> > but no luck. 
> > 
> > I don't find any message in the dmesg.
> > 
> > Zero message with journalctl
> > 
> > zero message with systemctl status
> > 
> > At the end I reboot once more time the server and everything work
> > fine
> > again.
> > 
> > Is anyone encounter something like that ? Is that “normal”
> > 
> > Regards
> > -- 
-- 
Albert SHIH 🦫 🐸
France
Heure locale/Local time:
mer. 06 nov. 2024 11:35:17 CET
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx