Hi,
what is the output of this command?
ceph config-key get mgr/cephadm/osd_remove_queue
I just tried to cancel a draining on a small 18.2.4 test cluster, it
went well, though. After scheduling the drain the mentioned key looks
like this:
# ceph config-key get mgr/cephadm/osd_remove_queue
[{"osd_id": 1, "started": true, "draining": false, "stopped": false,
"replace": false, "force": false, "zap": false, "hostname": "host5",
"original_weight": 0.0233917236328125, "drain_started_at": null,
"drain_stopped_at": null, "drain_done_at": null, "process_started_at":
"2024-08-19T07:21:27.783527Z"}, {"osd_id": 13, "started": true,
"draining": true, "stopped": false, "replace": false, "force": false,
"zap": false, "hostname": "host5", "original_weight":
0.0233917236328125, "drain_started_at": "2024-08-19T07:21:30.365237Z",
"drain_stopped_at": null, "drain_done_at": null, "process_started_at":
"2024-08-19T07:21:27.794688Z"}]
Here you see the original_weight which the orchestrator failed to
read, apparently. (Note that there are only small 20 GB OSDs, hence
the small weight). You probably didn't have the output while the OSDs
were scheduled for draining, correct? I was able to break my cephadm
module by injecting that json again (it was already completed, hence
empty), but maybe I did it incorrectly, not sure yet.
Regards,
Eugen
Zitat von Benjamin Huth <benjaminmhuth@xxxxxxxxx>:
So about a week and a half ago, I started a drain on an incorrect host. I
fairly quickly realized that it was the wrong host, so I stopped the drain,
canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
dumped, edited the crush map to properly reweight those osds and host, and
applied the edited crush map. I then proceeded with a full drain of the
correct host and completed that before attempting to upgrade my cluster.
I started the upgrade, and all 3 of my managers were upgraded from 18.2.2
to 18.2.4. At this point, my managers started back up, but with an
orchestrator that had failed to start, so the upgrade was unable to
continue. My cluster is in a stage where only the 3 managers are upgraded
to 18.2.4 and every other part is at 18.2.2
Since my orchestrator is not able to start, I'm unfortunately not able to
run any ceph orch commands as I receive "Error ENOENT: Module not found"
because the cephadm module doesn't load.
Output of ceph versions:
{
"mon": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 5
},
"mgr": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 1
},
"osd": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 119
},
"mds": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 4
},
"overall": {
"ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
reef (stable)": 129
}
}
I mentioned in my previous post that I tried manually downgrading the
managers to 18.2.2 because I thought there may be an issue with 18.2.4, but
18.2.2 also has the PR that I believe is causing this (
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d)
so no luck
Thanks!
(so sorry, I did not reply all so you may have received this twice)
On Sat, Aug 17, 2024 at 2:55 AM Eugen Block <eblock@xxxxxx> wrote:
Just to get some background information, did you remove OSDs while
performing the upgrade? Or did you start OSD removal and then started
the upgrade? Upgrades should be started with a healthy cluster, but
one can’t guarantee that of course, OSDs and/or entire hosts can
obviously also fail during an upgrade.
Just trying to understand what could cause this (I haven’t upgraded
production clusters to Reef yet, only test clusters). Have you stopped
the upgrade to cancel the process entirely? Can you share this
information please:
ceph versions
ceph orch upgrade status
Zitat von Benjamin Huth <benjaminmhuth@xxxxxxxxx>:
> Just wanted to follow up on this, I am unfortunately still stuck with
this
> and can't find where the json for this value is stored. I'm wondering if
I
> should attempt to build a manager container with the code for this
> reverted to before the commit that introduced the original_weight
argument.
> Please let me know if you guys have any thoughts
>
> Thank you!
>
> On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth <benjaminmhuth@xxxxxxxxx>
wrote:
>
>> Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and have
>> encountered a problem with my managers. After they had been upgraded, my
>> ceph orch module broke because the cephadm module would not load. This
>> obviously halted the update because you can't really update without the
>> orchestrator. Here are the logs related to why the cephadm module fails
to
>> start:
>>
>> https://pastebin.com/SzHbEDVA
>>
>> and the relevent part here:
>>
>> "backtrace": [
>>
>> " File \\"/usr/share/ceph/mgr/cephadm/module.py\\", line 591, in
>> __init__\\n self.to_remove_osds.load_from_store()",
>>
>> " File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 918, in
>> load_from_store\\n osd_obj = OSD.from_json(osd, rm_util=self.rm_util)",
>>
>> " File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 783, in
>> from_json\\n return cls(**inp)",
>>
>> "TypeError: __init__() got an unexpected keyword argument
>> 'original_weight'"
>>
>> ]
>>
>> Unfortunately, I am at a loss to what passes this the original weight
>> argument. I have attempted to migrate back to 18.2.2 and successfully
>> redeployed a manager of that version, but it also has the same issue
with
>> the cephadm module. I believe this may be because I recently started
>> several OSD drains, then canceled them, causing this to manifest once
the
>> managers restarted.
>>
>> I went through a good bit of the source and found the module at fault:
>>
>>
https://github.com/ceph/ceph/blob/e0dd396793b679922e487332a2a4bc48e024a42f/src/pybind/mgr/cephadm/services/osd.py#L779
>>
>> as well as the PR that caused the issue:
>>
>>
https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d
>>
>> I unfortunately am not familiar enough with the ceph source to find the
>> ceph-config values I need to delete or smart enough to fix this myself.
Any
>> help would be super appreciated.
>>
>> Thanks!
>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx