Re: Bug with Cephadm module osd service preventing orchestrator start

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you so much for the help! Thanks to the issue you linked and the
other guy you replied to with the same issue, I was able to edit the
config-key and get my orchestrator back. Sorry for not checking the issues
as well as I should have, that's my bad there.

On Mon, Aug 19, 2024 at 6:12 AM Eugen Block <eblock@xxxxxx> wrote:

> There's a tracker issue for this:
>
> https://tracker.ceph.com/issues/67329
>
> Zitat von Eugen Block <eblock@xxxxxx>:
>
> > Hi,
> >
> > what is the output of this command?
> >
> > ceph config-key get mgr/cephadm/osd_remove_queue
> >
> > I just tried to cancel a draining on a small 18.2.4 test cluster, it
> > went well, though. After scheduling the drain the mentioned key
> > looks like this:
> >
> > # ceph config-key get mgr/cephadm/osd_remove_queue
> > [{"osd_id": 1, "started": true, "draining": false, "stopped": false,
> > "replace": false, "force": false, "zap": false, "hostname": "host5",
> > "original_weight": 0.0233917236328125, "drain_started_at": null,
> > "drain_stopped_at": null, "drain_done_at": null,
> > "process_started_at": "2024-08-19T07:21:27.783527Z"}, {"osd_id": 13,
> > "started": true, "draining": true, "stopped": false, "replace":
> > false, "force": false, "zap": false, "hostname": "host5",
> > "original_weight": 0.0233917236328125, "drain_started_at":
> > "2024-08-19T07:21:30.365237Z", "drain_stopped_at": null,
> > "drain_done_at": null, "process_started_at":
> > "2024-08-19T07:21:27.794688Z"}]
> >
> > Here you see the original_weight which the orchestrator failed to
> > read, apparently. (Note that there are only small 20 GB OSDs, hence
> > the small weight). You probably didn't have the output while the
> > OSDs were scheduled for draining, correct? I was able to break my
> > cephadm module by injecting that json again (it was already
> > completed, hence empty), but maybe I did it incorrectly, not sure yet.
> >
> > Regards,
> > Eugen
> >
> > Zitat von Benjamin Huth <benjaminmhuth@xxxxxxxxx>:
> >
> >> So about a week and a half ago, I started a drain on an incorrect host.
> I
> >> fairly quickly realized that it was the wrong host, so I stopped the
> drain,
> >> canceled the osd deletions with "ceph orch osd rm stop OSD_ID", then
> >> dumped, edited the crush map to properly reweight those osds and host,
> and
> >> applied the edited crush map. I then proceeded with a full drain of the
> >> correct host and completed that before attempting to upgrade my cluster.
> >>
> >> I started the upgrade, and all 3 of my managers were upgraded from
> 18.2.2
> >> to 18.2.4. At this point, my managers started back up, but with an
> >> orchestrator that had failed to start, so the upgrade was unable to
> >> continue. My cluster is in a stage where only the 3 managers are
> upgraded
> >> to 18.2.4 and every other part is at 18.2.2
> >>
> >> Since my orchestrator is not able to start, I'm unfortunately not able
> to
> >> run any ceph orch commands as I receive "Error ENOENT: Module not found"
> >> because the cephadm module doesn't load.
> >> Output of ceph versions:
> >> {
> >>    "mon": {
> >>        "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 5
> >>    },
> >>    "mgr": {
> >>        "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 1
> >>    },
> >>    "osd": {
> >>        "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 119
> >>    },
> >>    "mds": {
> >>        "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 4
> >>    },
> >>    "overall": {
> >>        "ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2)
> >> reef (stable)": 129
> >>    }
> >> }
> >>
> >> I mentioned in my previous post that I tried manually downgrading the
> >> managers to 18.2.2 because I thought there may be an issue with 18.2.4,
> but
> >> 18.2.2 also has the PR that I believe is causing this (
> >>
> https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d
> )
> >> so no luck
> >>
> >> Thanks!
> >> (so sorry, I did not reply all so you may have received this twice)
> >>
> >> On Sat, Aug 17, 2024 at 2:55 AM Eugen Block <eblock@xxxxxx> wrote:
> >>
> >>> Just to get some background information, did you remove OSDs while
> >>> performing the upgrade? Or did you start OSD removal and then started
> >>> the upgrade? Upgrades should be started with a healthy cluster, but
> >>> one can’t guarantee that of course, OSDs and/or entire hosts can
> >>> obviously also fail during an upgrade.
> >>> Just trying to understand what could cause this (I haven’t upgraded
> >>> production clusters to Reef yet, only test clusters). Have you stopped
> >>> the upgrade to cancel the process entirely? Can you share this
> >>> information please:
> >>>
> >>> ceph versions
> >>> ceph orch upgrade status
> >>>
> >>> Zitat von Benjamin Huth <benjaminmhuth@xxxxxxxxx>:
> >>>
> >>>> Just wanted to follow up on this, I am unfortunately still stuck with
> >>> this
> >>>> and can't find where the json for this value is stored. I'm wondering
> if
> >>> I
> >>>> should attempt to build a manager container  with the code for this
> >>>> reverted to before the commit that introduced the original_weight
> >>> argument.
> >>>> Please let me know if you guys have any thoughts
> >>>>
> >>>> Thank you!
> >>>>
> >>>> On Wed, Aug 14, 2024, 7:37 PM Benjamin Huth <benjaminmhuth@xxxxxxxxx>
> >>> wrote:
> >>>>
> >>>>> Hey there, so I went to upgrade my ceph from 18.2.2 to 18.2.4 and
> have
> >>>>> encountered a problem with my managers. After they had been
> upgraded, my
> >>>>> ceph orch module broke because the cephadm module would not load.
> This
> >>>>> obviously halted the update because you can't really update without
> the
> >>>>> orchestrator. Here are the logs related to why the cephadm module
> fails
> >>> to
> >>>>> start:
> >>>>>
> >>>>> https://pastebin.com/SzHbEDVA
> >>>>>
> >>>>> and the relevent part here:
> >>>>>
> >>>>> "backtrace": [
> >>>>>
> >>>>> " File \\"/usr/share/ceph/mgr/cephadm/module.py\\", line 591, in
> >>>>> __init__\\n self.to_remove_osds.load_from_store()",
> >>>>>
> >>>>> " File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 918,
> in
> >>>>> load_from_store\\n osd_obj = OSD.from_json(osd,
> rm_util=self.rm_util)",
> >>>>>
> >>>>> " File \\"/usr/share/ceph/mgr/cephadm/services/osd.py\\", line 783,
> in
> >>>>> from_json\\n return cls(**inp)",
> >>>>>
> >>>>> "TypeError: __init__() got an unexpected keyword argument
> >>>>> 'original_weight'"
> >>>>>
> >>>>> ]
> >>>>>
> >>>>> Unfortunately, I am at a loss to what passes this the original weight
> >>>>> argument. I have attempted to migrate back to 18.2.2 and successfully
> >>>>> redeployed a manager of that version, but it also has the same issue
> >>> with
> >>>>> the cephadm module. I believe this may be because I recently started
> >>>>> several OSD drains, then canceled them, causing this to manifest once
> >>> the
> >>>>> managers restarted.
> >>>>>
> >>>>> I went through a good bit of the source and found the module at
> fault:
> >>>>>
> >>>>>
> >>>
> https://github.com/ceph/ceph/blob/e0dd396793b679922e487332a2a4bc48e024a42f/src/pybind/mgr/cephadm/services/osd.py#L779
> >>>>>
> >>>>> as well as the PR that caused the issue:
> >>>>>
> >>>>>
> >>>
> https://github.com/ceph/ceph/commit/ba7fac074fb5ad072fcad10862f75c0a26a7591d
> >>>>>
> >>>>> I unfortunately am not familiar enough with the ceph source to find
> the
> >>>>> ceph-config values I need to delete or smart enough to fix this
> myself.
> >>> Any
> >>>>> help would be super appreciated.
> >>>>>
> >>>>> Thanks!
> >>>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux