Thank you so much for the help! Thanks to the issue you linked and the other guy you replied to with the same issue, I was able to edit the config-key and get my orchestrator back. Sorry for not checking the issues as well as I should have, that's my bad there. On Mon, Aug 19, 2024 at 6:12 AM Eugen Block <eblock@xxxxxx> wrote: > There's a tracker issue for this: > > https://tracker.ceph.com/issues/67329 > > Zitat von Eugen Block <eblock@xxxxxx>: > > > Hi, > > > > what is the output of this command? > > > > ceph config-key get mgr/cephadm/osd_remove_queue > > > > I just tried to cancel a draining on a small 18.2.4 test cluster, it > > went well, though. After scheduling the drain the mentioned key > > looks like this: > > > > # ceph config-key get mgr/cephadm/osd_remove_queue > > [{"osd_id": 1, "started": true, "draining": false, "stopped": false, > > "replace": false, "force": false, "zap": false, "hostname": "host5", > > "original_weight": 0.0233917236328125, "drain_started_at": null, > > "drain_stopped_at": null, "drain_done_at": null, > > "process_started_at": "2024-08-19T07:21:27.783527Z"}, {"osd_id": 13, > > "started": true, "draining": true, "stopped": false, "replace": > > false, "force": false, "zap": false, "hostname": "host5", > > "original_weight": 0.0233917236328125, "drain_started_at": > > "2024-08-19T07:21:30.365237Z", "drain_stopped_at": null, > > "drain_done_at": null, "process_started_at": > > "2024-08-19T07:21:27.794688Z"}] > > > > Here you see the original_weight which the orchestrator failed to > > read, apparently. (Note that there are only small 20 GB OSDs, hence > > the small weight). You probably didn't have the output while the > > OSDs were scheduled for draining, correct? I was able to break my > > cephadm module by injecting that json again (it was already > > completed, hence empty), but maybe I did it incorrectly, not sure yet. > > > > Regards, > > Eugen > > > > Zitat von Benjamin Huth <benjaminmhuth@xxxxxxxxx>: > > > >> So about a week and a half ago, I started a drain on an incorrect host. > I > >> fairly quickly realized that it was the wrong host, so I stopped the > drain, > >> canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then > >> dumped, edited the crush map to properly reweight those osds and host, > and > >> applied the edited crush map. I then proceeded with a full drain of the > >> correct host and completed that before attempting to upgrade my cluster. > >> > >> I started the upgrade, and all 3 of my managers were upgraded from > 18.2.2 > >> to 18.2.4. At this point, my managers started back up, but with an > >> orchestrator that had failed to start, so the upgrade was unable to > >> continue. My cluster is in a stage where only the 3 managers are > upgraded > >> to 18.2.4 and every other part is at 18.2.2 > >> > >> Since my orchestrator is not able to start, I'm unfortunately not able > to > >> run any ceph orch commands as I receive "Error ENOENT: Module not found" > >> because the cephadm module doesn't load. > >> Output of ceph versions: > >> { > >> "mon": { > >> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) > >> reef (stable)": 5 > >> }, > >> "mgr": { > >> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) > >> reef (stable)": 1 > >> }, > >> "osd": { > >> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) > >> reef (stable)": 119 > >> }, > >> "mds": { > >> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) > >> reef (stable)": 4 > >> }, > >> "overall": { > >> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) > >> reef (stable)": 129 > >> } > >> } > >> > >> I mentioned in my previous post that I tried manually downgrading the > >> managers to 18.2.2 because I thought there may be an issue with 18.2.4, > but > >> 18.2.2 also has the PR that I believe is causing this ( > >> > https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d > ) > >> so no luck > >> > >> Thanks! > >> (so sorry, I did not reply all so you may have received this twice) > >> > >> On Sat, Aug 17, 2024 at 2:55 AM Eugen Block <eblock@xxxxxx> wrote: > >> > >>> Just to get some background information, did you remove OSDs while > >>> performing the upgrade? Or did you start OSD removal and then started > >>> the upgrade? Upgrades should be started with a healthy cluster, but > >>> one can’t guarantee that of course, OSDs and/or entire hosts can > >>> obviously also fail during an upgrade. > >>> Just trying to understand what could cause this (I haven’t upgraded > >>> production clusters to Reef yet, only test clusters). Have you stopped > >>> the upgrade to cancel the process entirely? Can you share this > >>> information please: > >>> > >>> ceph versions > >>> ceph orch upgrade status > >>> > >>> Zitat von Benjamin Huth <benjaminmhuth@xxxxxxxxx>: > >>> > >>>> Just wanted to follow up on this, I am unfortunately still stuck with > >>> this > >>>> and can't find where the json for this value is stored. I'm wondering > if > >>> I > >>>> should attempt to build a manager container with the code for this > >>>> reverted to before the commit that introduced the original_weight > >>> argument. > >>>> Please let me know if you guys have any thoughts > >>>> > >>>> Thank you! > >>>> > >>>> On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth <benjaminmhuth@xxxxxxxxx> > >>> wrote: > >>>> > >>>>> Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and > have > >>>>> encountered a problem with my managers. After they had been > upgraded, my > >>>>> ceph orch module broke because the cephadm module would not load. > This > >>>>> obviously halted the update because you can't really update without > the > >>>>> orchestrator. Here are the logs related to why the cephadm module > fails > >>> to > >>>>> start: > >>>>> > >>>>> https://pastebin.com/SzHbEDVA > >>>>> > >>>>> and the relevent part here: > >>>>> > >>>>> "backtrace": [ > >>>>> > >>>>> " File \\"/usr/share/ceph/mgr/cephadm/module.py\\", line 591, in > >>>>> __init__\\n self.to_remove_osds.load_from_store()", > >>>>> > >>>>> " File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 918, > in > >>>>> load_from_store\\n osd_obj = OSD.from_json(osd, > rm_util=self.rm_util)", > >>>>> > >>>>> " File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 783, > in > >>>>> from_json\\n return cls(**inp)", > >>>>> > >>>>> "TypeError: __init__() got an unexpected keyword argument > >>>>> 'original_weight'" > >>>>> > >>>>> ] > >>>>> > >>>>> Unfortunately, I am at a loss to what passes this the original weight > >>>>> argument. I have attempted to migrate back to 18.2.2 and successfully > >>>>> redeployed a manager of that version, but it also has the same issue > >>> with > >>>>> the cephadm module. I believe this may be because I recently started > >>>>> several OSD drains, then canceled them, causing this to manifest once > >>> the > >>>>> managers restarted. > >>>>> > >>>>> I went through a good bit of the source and found the module at > fault: > >>>>> > >>>>> > >>> > https://github.com/ceph/ceph/blob/e0dd396793b679922e487332a2a4bc48e024a42f/src/pybind/mgr/cephadm/services/osd.py#L779 > >>>>> > >>>>> as well as the PR that caused the issue: > >>>>> > >>>>> > >>> > https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d > >>>>> > >>>>> I unfortunately am not familiar enough with the ceph source to find > the > >>>>> ceph-config values I need to delete or smart enough to fix this > myself. > >>> Any > >>>>> help would be super appreciated. > >>>>> > >>>>> Thanks! > >>>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list -- ceph-users@xxxxxxx > >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > >>> > >>> _______________________________________________ > >>> ceph-users mailing list -- ceph-users@xxxxxxx > >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx