That can be a bit sticky. First, check to see if you have a /var/log/messages file. The dmesg log isn't always as complete. Also, of course, make sure you have enough spare RAM and disk space to run the OSD. When running a Managed OSD, a LOT of space is used under the root directory several layers down, where the container image and support data are kept. A 'df -h /' should give you a clue if there's a problem there. Failing that, the next consideration is which type of OSD is this? Is is a legacy OSD or a managed (containerized) OSB? Or, Ganesha forbid, both? Legacy OSDs are defined in /var/lib/ceph. Containerized OSDs are defined in /var/lib/ceph/{fsid}. If the same OSD number is defined in BOTH places, congratulations! Welcome to the schizophrenic OSD club! Cleaning up a dual-defined OSD is rather daunting, and I recommend scanning this message list for stuff with my name on it from back around this June, where I slogged my own way through that problem. I'm FAIRLY sure that if you stop Ceph, delete the /var/lib/ceph OSD and restart that things will clear up. Back up your data first, though! Just in case. The actual OSD data is not contained within that directory, but rather soft-linked. But why take chances? If you did that, reboot and do a "systemctl --failed". If you see a failed OSD unit and it doesn't have your fsid in its name, then you probably will find a legacy systemd unit in /etc/ceph that you should delete. That should clear you up. Now, on the other hand, let's suppose you have no legacy OSD metadata, just a Managed OSD. In that case, I'd reweight the OSD and let the system reach equilibrium. Check via "ceph osd tree" to make sure things are set properly. The ""ceph orch ps" command is also a friend to make sure you know what's running and what isn't. Once (if!) the OSD has drained, the cleanest approach would be to destroy and re-create it anew. So much for simple solutions. Managed OSDs are under systemd control, but there is so permanent systemd unit file for them..Instead a template common to all OSDs is injected with the OSD number and the resulting unit goes into a volatile directory (/var/run/systemd, I think). In the normal course of events, the generic Ceph container image is launched via docker or podman. The launch options by defauly include the '-d' option which destroys the container when it is stopped. Which also destroys the container's log! In a properly functioning system, this doesn't matter, since Ceph redirects the container log to the systemd journal. Where it gets problematic is if the container fails before that point is reached. I've diagnosed Ceph problems buy doing a "brute force" launch of a Ceph container without the "-d" option, but it's not for the faint of heart. Tim On Tue, 2024-11-05 at 15:25 +0100, Albert Shih wrote: > Hi everyone, > > > I'm running currently ceph reef 18.2.4 with podman with Debian 12 > > Today I've to reboot the cluster (firmware alerts), everything goes > as the > plan but....one osd on one server won't start. I try everything in > > > https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-osd/ > > but no luck. > > I don't find any message in the dmesg. > > Zero message with journalctl > > zero message with systemctl status > > At the end I reboot once more time the server and everything work > fine > again. > > Is anyone encounter something like that ? Is that “normal” > > Regards > -- > Albert SHIH 🦫 🐸 > Observatoire de Paris > France > Heure locale/Local time: > mar. 05 nov. 2024 15:11:40 CET > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx