Le 05/11/2024 à 13:59:34-0500, Tim Holloway a écrit Hi, Thanks for you long answer. > That can be a bit sticky. > > First, check to see if you have a /var/log/messages file. The dmesg log > isn't always as complete. Forget to say, no relevant message at least for my understanding, just message very close to what systemctl give me. The container start and died. > Also, of course, make sure you have enough spare RAM and disk space to > run the OSD. When running a Managed OSD, a LOT of space is used under > the root directory several layers down, where the container image and > support data are kept. A 'df -h /' should give you a clue if there's a > problem there. No problem around that, the server was running without issue before de reboot. > > Failing that, the next consideration is which type of OSD is this? Is > is a legacy OSD or a managed (containerized) OSB? Or, Ganesha forbid, > both? managed, with podman and cephadm > > Legacy OSDs are defined in /var/lib/ceph. Containerized OSDs are > defined in /var/lib/ceph/{fsid}. If the same OSD number is defined in > BOTH places, congratulations! Welcome to the schizophrenic OSD club! > > Cleaning up a dual-defined OSD is rather daunting, and I recommend > scanning this message list for stuff with my name on it from back > around this June, where I slogged my own way through that problem. > > I'm FAIRLY sure that if you stop Ceph, delete the /var/lib/ceph OSD and > restart that things will clear up. Back up your data first, though! > Just in case. The actual OSD data is not contained within that > directory, but rather soft-linked. But why take chances? > > If you did that, reboot and do a "systemctl --failed". If you see a > failed OSD unit and it doesn't have your fsid in its name, then you > probably will find a legacy systemd unit in /etc/ceph that you should > delete. That should clear you up. > > Now, on the other hand, let's suppose you have no legacy OSD metadata, > just a Managed OSD. In that case, I'd reweight the OSD and let the > system reach equilibrium. Check via "ceph osd tree" to make sure things > are set properly. The ""ceph orch ps" command is also a friend to make > sure you know what's running and what isn't. > > Once (if!) the OSD has drained, the cleanest approach would be to > destroy and re-create it anew. > > So much for simple solutions. Managed OSDs are under systemd control, > but there is so permanent systemd unit file for them..Instead a > template common to all OSDs is injected with the OSD number and the > resulting unit goes into a volatile directory (/var/run/systemd, I > think). In the normal course of events, the generic Ceph container > image is launched via docker or podman. The launch options by defauly > include the '-d' option which destroys the container when it is > stopped. Which also destroys the container's log! > > In a properly functioning system, this doesn't matter, since Ceph > redirects the container log to the systemd journal. Where it gets > problematic is if the container fails before that point is reached. > > I've diagnosed Ceph problems buy doing a "brute force" launch of a Ceph > container without the "-d" option, but it's not for the faint of heart. ;-) ;-) What I don't understand is why after another reboot everything work again (until now without issue). Regards. JAS > On Tue, 2024-11-05 at 15:25 +0100, Albert Shih wrote: > > Hi everyone, > > > > > > I'm running currently ceph reef 18.2.4 with podman with Debian 12 > > > > Today I've to reboot the cluster (firmware alerts), everything goes > > as the > > plan but....one osd on one server won't start. I try everything in > > > > > > https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-osd/ > > > > but no luck. > > > > I don't find any message in the dmesg. > > > > Zero message with journalctl > > > > zero message with systemctl status > > > > At the end I reboot once more time the server and everything work > > fine > > again. > > > > Is anyone encounter something like that ? Is that “normal” > > > > Regards > > -- -- Albert SHIH 🦫 🐸 France Heure locale/Local time: mer. 06 nov. 2024 11:35:17 CET _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx