I'd put in an RFO to detect/prevent creation of mutually-exclusive OSD definitions on a single OSD storage unit myself, since that's the real problem. As Eugen has noted, you can up-convert a traditional OSD to cephadm manaegment... unless there's already a managed instance existing. I can attest that from experience. the reasons I recommend complete destruction/rebuilding of the offending OSDs boil down to this: 1. Not knowing the internals of either new or old OSD logic, I feel it risky to just rip things out by brute force. 2. Because two independent and mutually-ignorant processes have been working on the same OSD backing store, there is possibility for corruption.I don't expect an OSD to have suitable interlocks to prevent that, since normally a single OSD is the sole owner of its backing store and interlocks would just slow it down for no purpose. Thus, by cleanly shutting down the old-style OSD process, leaving just the container-based OSD running, draining that OSD, wiping out everything that automated cleanup missed and re-creating the OSD, all of the data in the OSD is going to have passed through the migration process twice, and I would expect migration to detect and clean up (or at least report) any inconsistencies, so that they don't pop up months or years later. Granted, if you have triple redundancy on the pools in the OSD, the likelihood of lost/damaged data is pretty low, but in my case, I was still recovering from a fried Internet connection and didn't want any more surprises. So ultimately the choice is yours, Quick-fix or slow. Tim On Sat, 2024-08-17 at 08:05 +0000, Eugen Block wrote: > Hi, > > > When things settle down, I *MIGHT* put in a RFE to change the > > default for ceph-volume to --no-systemd to save someone else from > > this anguish. > > note that there are still users/operators/admins who don't use > containers. Changing the ceph-volume default might not be the best > idea in this case. > > Regarding the cleanup, this was the thread [1] Tim was referring to. > I > would set the noout flag, stop an OSD (so the device won't be busy > anymore), make sure that both ceph-osd@{OSD_ID} and > ceph-{FSID}@osd.{OSD_ID} then double check that everything you need > is > still under /var/lib/ceph/{FSID}/osd.{OSD_ID}, like configs an > keyrings. Disable the ceph-osd@{OSD_ID} (as already pointed out), > then > check if the orchestrator can start the OSD via systemd: > > ceph orch daemon start osd.{OSD_ID} > > or alternatively, try it manually: > > systemctl reset-failed > systemctl start ceph-{FSID}@osd.{OSD_ID} > > Watch the log for that OSD to identify any issues. If it works, > unset > the noout flag. You might want to ensure it also works after a > reboot, > though. > I don't think it should be necessary to redeploy the OSDs, but the > cleanup has to be proper. > As a guidance you can check the cephadm tool's contents and look for > the "adopt" function. That migrates the contents of the pre-cephadm > daemons into the FSID specific directories. > > Regards, > Eugen > > [1] > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/K2R3MXRD3S2DSXCEGX5IPLCF5L3UUOQI/ > > Zitat von Dan O'Brien <dobrie2@xxxxxxx>: > > > OK... I've been in the Circle of Hell where systemd lives and I > > *THINK* I have convinced myself I'm OK. I *REALLY* don't want to > > trash and rebuild the OSDs. > > > > In the manpage for systemd.unit, I found > > > > UNIT GARBAGE COLLECTION > > The system and service manager loads a unit's configuration > > automatically when a unit is referenced for the first time. It > > will > > automatically unload the unit configuration and state again when > > the > > unit is not needed anymore ("garbage collection"). > > > > I've disabled the systemd units (which removes the symlink from > > the > > target) for the non-cephadm OSDs I created by mistake and I'm > > PRETTY > > SURE if I wait long enough (or reboot) that I won't see them any > > more, since there won't be a unit for systemd to care about. > > > > I *WILL* have to clean up /var/lib/ceph/osd eventually. I tried > > just > > now, but it says "device busy." I think that's because there's > > some > > OTHER systemd cruft that shows a mount: > > [root@ceph02 ~]# systemctl --all | grep ceph | grep mount > > var-lib-ceph-osd-ceph\x2d11.mount loaded active > > mounted /var/lib/ceph/osd/ceph-11 > > var-lib-ceph-osd-ceph\x2d25.mount loaded active > > mounted /var/lib/ceph/osd/ceph-25 > > var-lib-ceph-osd-ceph\x2d9.mount loaded active > > mounted /var/lib/ceph/osd/ceph-9 > > > > When things settle down, I *MIGHT* put in a RFE to change the > > default for ceph-volume to --no-systemd to save someone else from > > this anguish. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx