There's a tracker issue for this:
https://tracker.ceph.com/issues/67329
Zitat von Eugen Block <eblock@xxxxxx>:
> Hi,
>
> what is the output of this command?
>
> ceph config-key get mgr/cephadm/osd_remove_queue
>
> I just tried to cancel a draining on a small 18.2.4 test cluster, it
> went well, though. After scheduling the drain the mentioned key
> looks like this:
>
> # ceph config-key get mgr/cephadm/osd_remove_queue
> [{"osd_id": 1, "started": true, "draining": false, "stopped": false,
> "replace": false, "force": false, "zap": false, "hostname": "host5",
> "original_weight": 0.0233917236328125, "drain_started_at": null,
> "drain_stopped_at": null, "drain_done_at": null,
> "process_started_at": "2024-08-19T07:21:27.783527Z"}, {"osd_id": 13,
> "started": true, "draining": true, "stopped": false, "replace":
> false, "force": false, "zap": false, "hostname": "host5",
> "original_weight": 0.0233917236328125, "drain_started_at":
> "2024-08-19T07:21:30.365237Z", "drain_stopped_at": null,
> "drain_done_at": null, "process_started_at":
> "2024-08-19T07:21:27.794688Z"}]
>
> Here you see the original_weight which the orchestrator failed to
> read, apparently. (Note that there are only small 20 GB OSDs, hence
> the small weight). You probably didn't have the output while the
> OSDs were scheduled for draining, correct? I was able to break my
> cephadm module by injecting that json again (it was already
> completed, hence empty), but maybe I did it incorrectly, not sure yet.
>
> Regards,
> Eugen
>
> Zitat von Benjamin Huth <benjaminmhuth@xxxxxxxxx>:
>
>> So about a week and a half ago, I started a drain on an incorrect host.
I
>> fairly quickly realized that it was the wrong host, so I stopped the
drain,
>> canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
>> dumped, edited the crush map to properly reweight those osds and host,
and
>> applied the edited crush map. I then proceeded with a full drain of the
>> correct host and completed that before attempting to upgrade my cluster.
>>
>> I started the upgrade, and all 3 of my managers were upgraded from
18.2.2
>> to 18.2.4. At this point, my managers started back up, but with an
>> orchestrator that had failed to start, so the upgrade was unable to
>> continue. My cluster is in a stage where only the 3 managers are
upgraded
>> to 18.2.4 and every other part is at 18.2.2
>>
>> Since my orchestrator is not able to start, I'm unfortunately not able
to
>> run any ceph orch commands as I receive "Error ENOENT: Module not found"
>> because the cephadm module doesn't load.
>> Output of ceph versions:
>> {
>> "mon": {
>> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 5
>> },
>> "mgr": {
>> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 1
>> },
>> "osd": {
>> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 119
>> },
>> "mds": {
>> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 4
>> },
>> "overall": {
>> "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
>> reef (stable)": 129
>> }
>> }
>>
>> I mentioned in my previous post that I tried manually downgrading the
>> managers to 18.2.2 because I thought there may be an issue with 18.2.4,
but
>> 18.2.2 also has the PR that I believe is causing this (
>>
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d
)
>> so no luck
>>
>> Thanks!
>> (so sorry, I did not reply all so you may have received this twice)
>>
>> On Sat, Aug 17, 2024 at 2:55 AM Eugen Block <eblock@xxxxxx> wrote:
>>
>>> Just to get some background information, did you remove OSDs while
>>> performing the upgrade? Or did you start OSD removal and then started
>>> the upgrade? Upgrades should be started with a healthy cluster, but
>>> one can’t guarantee that of course, OSDs and/or entire hosts can
>>> obviously also fail during an upgrade.
>>> Just trying to understand what could cause this (I haven’t upgraded
>>> production clusters to Reef yet, only test clusters). Have you stopped
>>> the upgrade to cancel the process entirely? Can you share this
>>> information please:
>>>
>>> ceph versions
>>> ceph orch upgrade status
>>>
>>> Zitat von Benjamin Huth <benjaminmhuth@xxxxxxxxx>:
>>>
>>>> Just wanted to follow up on this, I am unfortunately still stuck with
>>> this
>>>> and can't find where the json for this value is stored. I'm wondering
if
>>> I
>>>> should attempt to build a manager container with the code for this
>>>> reverted to before the commit that introduced the original_weight
>>> argument.
>>>> Please let me know if you guys have any thoughts
>>>>
>>>> Thank you!
>>>>
>>>> On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth <benjaminmhuth@xxxxxxxxx>
>>> wrote:
>>>>
>>>>> Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and
have
>>>>> encountered a problem with my managers. After they had been
upgraded, my
>>>>> ceph orch module broke because the cephadm module would not load.
This
>>>>> obviously halted the update because you can't really update without
the
>>>>> orchestrator. Here are the logs related to why the cephadm module
fails
>>> to
>>>>> start:
>>>>>
>>>>> https://pastebin.com/SzHbEDVA
>>>>>
>>>>> and the relevent part here:
>>>>>
>>>>> "backtrace": [
>>>>>
>>>>> " File \\"/usr/share/ceph/mgr/cephadm/module.py\\", line 591, in
>>>>> __init__\\n self.to_remove_osds.load_from_store()",
>>>>>
>>>>> " File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 918,
in
>>>>> load_from_store\\n osd_obj = OSD.from_json(osd,
rm_util=self.rm_util)",
>>>>>
>>>>> " File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 783,
in
>>>>> from_json\\n return cls(**inp)",
>>>>>
>>>>> "TypeError: __init__() got an unexpected keyword argument
>>>>> 'original_weight'"
>>>>>
>>>>> ]
>>>>>
>>>>> Unfortunately, I am at a loss to what passes this the original weight
>>>>> argument. I have attempted to migrate back to 18.2.2 and successfully
>>>>> redeployed a manager of that version, but it also has the same issue
>>> with
>>>>> the cephadm module. I believe this may be because I recently started
>>>>> several OSD drains, then canceled them, causing this to manifest once
>>> the
>>>>> managers restarted.
>>>>>
>>>>> I went through a good bit of the source and found the module at
fault:
>>>>>
>>>>>
>>>
https://github.com/ceph/ceph/blob/e0dd396793b679922e487332a2a4bc48e024a42f/src/pybind/mgr/cephadm/services/osd.py#L779
>>>>>
>>>>> as well as the PR that caused the issue:
>>>>>
>>>>>
>>>
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d
>>>>>
>>>>> I unfortunately am not familiar enough with the ceph source to find
the
>>>>> ceph-config values I need to delete or smart enough to fix this
myself.
>>> Any
>>>>> help would be super appreciated.
>>>>>
>>>>> Thanks!
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>