Re: cephadm trouble

Adam King <adking@xxxxxxxxxx> · Tue, 1 Feb 2022 11:42:32 -0500

Glad it's working. Honestly, no idea how that happened, never seen it
before. Let me know if you ever find out what command caused it.

- Adam King

On Tue, Feb 1, 2022 at 11:29 AM Fyodor Ustinov <ufm@xxxxxx> wrote:

> Hi!
>
> Adam! Big thanx!
>
> "ceph config rm osd.91 container_image" completly solve this trouble.
> I don't understand why this happened, but at least now everything works.
>
> Thank you so much again!
>
>
> ----- Original Message -----
> > From: "Fyodor Ustinov" <ufm@xxxxxx>
> > To: "Adam King" <adking@xxxxxxxxxx>
> > Cc: "ceph-users" <ceph-users@xxxxxxx>
> > Sent: Tuesday, 1 February, 2022 18:12:16
> > Subject:  Re: cephadm trouble
>
> > Hi!
> > YES! HERE IT IS!
> >
> > global             basic     container_image
> >
> quay.io/ceph/ceph@sha256:2f7f0af8663e73a422f797de605e769ae44eb0297f2a79324739404cc1765728
> > *
> >    osd.91         basic     container_image
> >    s-8-2-1:/dev/bcache0
> >
> > Two questions:
> > 1. How did it get there
> > 2. How to delete it - as far as I understand this field is not editable?
> >
> >
> > ----- Original Message -----
> >> From: "Adam King" <adking@xxxxxxxxxx>
> >> To: "Fyodor Ustinov" <ufm@xxxxxx>
> >> Cc: "ceph-users" <ceph-users@xxxxxxx>
> >> Sent: Tuesday, 1 February, 2022 17:45:13
> >> Subject: Re:  Re: cephadm trouble
> >
> >> As a follow up to my previous comment, could you also post "ceph config
> >> dump | grep container_image". It's related to the repo digest thing and
> >> it's another way we could maybe discover where "s-8-2-1:/dev/bcache0" is
> >> set as an image.
> >>
> >> - Adam King
> >>
> >> On Tue, Feb 1, 2022 at 8:52 AM Adam King <adking@xxxxxxxxxx> wrote:
> >>
> >>> Hi Fyodor,
> >>>
> >>> Honestly I'm super confused by your case. Daemon add osd is meant to
> be a
> >>> one time synchronous command so the idea that that is causing this
> repeated
> >>> pull in this fashion is super odd. I think I would need some sort of
> list
> >>> of commands run on this cluster or some type of reproducer. As
> mentioned
> >>> before, cephadm definitely thinks "s-8-2-1:/dev/bcache0" is the name
> of a
> >>> container image but I can't think of where that is set as I didn't see
> it
> >>> in any of the posted service specs or the config options for the any
> of the
> >>> images but it clearly must be set somewhere or we wouldn't be trying to
> >>> pull that repeatedly. Never seen an issue like this before. This is a
> total
> >>> long shot, but you could trying setting "ceph config set mgr
> >>> mgr/cephadm/use_repo_digest false" and see if it at least lets you
> refresh
> >>> the daemons and make progress (or at least gets us different things in
> the
> >>> logs).
> >>>
> >>> Sorry for not being too helpful,
> >>>
> >>> - Adam King
> >>>
> >>> On Tue, Feb 1, 2022 at 3:27 AM Fyodor Ustinov <ufm@xxxxxx> wrote:
> >>>
> >>>> Hi!
> >>>>
> >>>> No mode ideas? :(
> >>>>
> >>>>
> >>>> ----- Original Message -----
> >>>> > From: "Fyodor Ustinov" <ufm@xxxxxx>
> >>>> > To: "Adam King" <adking@xxxxxxxxxx>
> >>>> > Cc: "ceph-users" <ceph-users@xxxxxxx>
> >>>> > Sent: Friday, 28 January, 2022 23:02:26
> >>>> > Subject:  Re: cephadm trouble
> >>>>
> >>>> > Hi!
> >>>> >
> >>>> >> Hmm, I'm not seeing anything that could be a cause in any of that
> >>>> output. I
> >>>> >> did notice, however, from your "ceph orch ls" output that none of
> your
> >>>> >> services have been refreshed since the 24th. Cephadm typically
> tries to
> >>>> >> refresh these things every 10 minutes so that signals something is
> >>>> quite
> >>>> >> wrong.
> >>>> > From what I see in /var/log/ceph/cephadm.log it tries to run the
> same
> >>>> command
> >>>> > once a minute and does nothing else. That's why the status has not
> been
> >>>> updated
> >>>> > for 5 days.
> >>>> >
> >>>> >> Could you try running "ceph mgr fail" and if nothing seems to be
> >>>> >> resolved could you post "ceph log last 200 debug cephadm". Maybe we
> >>>> can see
> >>>> >> if something gets stuck again after the mgr restarts.
> >>>> > "ceph mgr fail" did not help.
> >>>> > "ceph log last 200 debug cephadm" show again and again and again:
> >>>> >
> >>>> > 2022-01-28T20:57:12.792090+0000 mgr.s-26-9-24-mon-m2.nhltmq
> >>>> (mgr.129738166) 349
> >>>> > : cephadm [ERR] cephadm exited with an error code: 1, stderr:Pulling
> >>>> container
> >>>> > image s-8-2-1:/dev/bcache0...
> >>>> > Non-zero exit code 125 from /usr/bin/podman pull
> s-8-2-1:/dev/bcache0
> >>>> > /usr/bin/podman: stderr Error: invalid reference format
> >>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>>> > Traceback (most recent call last):
> >>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in
> >>>> _remote_connection
> >>>> >    yield (conn, connr)
> >>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in
> _run_cephadm
> >>>> >    code, '\n'.join(err)))
> >>>> > orchestrator._interface.OrchestratorError: cephadm exited with an
> error
> >>>> code: 1,
> >>>> > stderr:Pulling container image s-8-2-1:/dev/bcache0...
> >>>> > Non-zero exit code 125 from /usr/bin/podman pull
> s-8-2-1:/dev/bcache0
> >>>> > /usr/bin/podman: stderr Error: invalid reference format
> >>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>>> > 2022-01-28T20:58:13.092996+0000 mgr.s-26-9-24-mon-m2.nhltmq
> >>>> (mgr.129738166) 392
> >>>> > : cephadm [ERR] cephadm exited with an error code: 1, stderr:Pulling
> >>>> container
> >>>> > image s-8-2-1:/dev/bcache0...
> >>>> > Non-zero exit code 125 from /usr/bin/podman pull
> s-8-2-1:/dev/bcache0
> >>>> > /usr/bin/podman: stderr Error: invalid reference format
> >>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>>> > Traceback (most recent call last):
> >>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in
> >>>> _remote_connection
> >>>> >    yield (conn, connr)
> >>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in
> _run_cephadm
> >>>> >    code, '\n'.join(err)))
> >>>> > orchestrator._interface.OrchestratorError: cephadm exited with an
> error
> >>>> code: 1,
> >>>> > stderr:Pulling container image s-8-2-1:/dev/bcache0...
> >>>> > Non-zero exit code 125 from /usr/bin/podman pull
> s-8-2-1:/dev/bcache0
> >>>> > /usr/bin/podman: stderr Error: invalid reference format
> >>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>>> >
> >>>> >>
> >>>> >> Thanks,
> >>>> >>
> >>>> >> - Adam King
> >>>> >>
> >>>> >> On Thu, Jan 27, 2022 at 7:06 PM Fyodor Ustinov <ufm@xxxxxx> wrote:
> >>>> >>
> >>>> >>> Hi!
> >>>> >>>
> >>>> >>> I think this happened after I tried to recreate the osd with the
> >>>> command
> >>>> >>> "ceph orch daemon add osd s-8-2-1:/dev/bcache0"
> >>>> >>>
> >>>> >>>
> >>>> >>> > It looks like cephadm believes "s-8-2-1:/dev/bcache0" is a
> container
> >>>> >>> image
> >>>> >>> > for some daemon. Can you provide the output of "ceph orch ls
> >>>> --format
> >>>> >>> > yaml",
> >>>> >>>
> >>>> >>> https://pastebin.com/CStBf4J0
> >>>> >>>
> >>>> >>> > "ceph orch upgrade status",
> >>>> >>> root@s-26-9-19-mon-m1:~# ceph orch upgrade status
> >>>> >>> {
> >>>> >>>     "target_image": null,
> >>>> >>>     "in_progress": false,
> >>>> >>>     "services_complete": [],
> >>>> >>>     "progress": null,
> >>>> >>>     "message": ""
> >>>> >>> }
> >>>> >>>
> >>>> >>>
> >>>> >>> > "ceph config get mgr container_image",
> >>>> >>> root@s-26-9-19-mon-m1:~# ceph config get mgr container_image
> >>>> >>>
> >>>> >>>
> >>>>
> quay.io/ceph/ceph@sha256:2f7f0af8663e73a422f797de605e769ae44eb0297f2a79324739404cc1765728
> >>>> >>>
> >>>> >>>
> >>>> >>> > and the values for monitoring stack container images (format is
> >>>> "ceph
> >>>> >>> > config get mgr mgr/cephadm/container_image_<daemon-type>" where
> >>>> daemon
> >>>> >>> type
> >>>> >>> > is one of "prometheus", "node_exporter", "alertmanager",
> "grafana",
> >>>> >>> > "haproxy", "keepalived").
> >>>> >>> quay.io/prometheus/prometheus:v2.18.1
> >>>> >>> quay.io/prometheus/node-exporter:v0.18.1
> >>>> >>> quay.io/prometheus/alertmanager:v0.20.0
> >>>> >>> quay.io/ceph/ceph-grafana:6.7.4
> >>>> >>> docker.io/library/haproxy:2.3
> >>>> >>> docker.io/arcts/keepalived
> >>>> >>>
> >>>> >>> >
> >>>> >>> > Thanks,
> >>>> >>> >
> >>>> >>> > - Adam King
> >>>> >>>
> >>>> >>> Thanks a lot!
> >>>> >>>
> >>>> >>> WBR,
> >>>> >>>     Fyodor.
> >>>> >>>
> >>>> >>> >
> >>>> >>> > On Thu, Jan 27, 2022 at 9:10 AM Fyodor Ustinov <ufm@xxxxxx>
> wrote:
> >>>> >>> >
> >>>> >>> >> Hi!
> >>>> >>> >>
> >>>> >>> >> I rebooted the nodes with mgr and now I see the following in
> the
> >>>> >>> >> cephadm.log:
> >>>> >>> >>
> >>>> >>> >> As I understand it - cephadm is trying to execute some
> unsuccessful
> >>>> >>> >> command of mine (I wonder which one), it does not succeed, but
> it
> >>>> keeps
> >>>> >>> >> trying and trying. How do I stop it from trying?
> >>>> >>> >>
> >>>> >>> >> 2022-01-27 16:02:58,123 7fca7beca740 DEBUG
> >>>> >>> >>
> >>>> >>>
> >>>>
> --------------------------------------------------------------------------------
> >>>> >>> >> cephadm ['--image', 's-8-2-1:/dev/bcache0', 'pull']
> >>>> >>> >> 2022-01-27 16:02:58,147 7fca7beca740 DEBUG /usr/bin/podman:
> 3.3.1
> >>>> >>> >> 2022-01-27 16:02:58,249 7fca7beca740 INFO Pulling container
> image
> >>>> >>> >> s-8-2-1:/dev/bcache0...
> >>>> >>> >> 2022-01-27 16:02:58,278 7fca7beca740 DEBUG /usr/bin/podman:
> Error:
> >>>> >>> invalid
> >>>> >>> >> reference format
> >>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 INFO Non-zero exit code
> 125
> >>>> from
> >>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 INFO /usr/bin/podman:
> stderr
> >>>> Error:
> >>>> >>> >> invalid reference format
> >>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 ERROR ERROR: Failed
> command:
> >>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>>> >>> >> 2022-01-27 16:03:58,420 7f897a7a6740 DEBUG
> >>>> >>> >>
> >>>> >>>
> >>>>
> --------------------------------------------------------------------------------
> >>>> >>> >> cephadm ['--image', 's-8-2-1:/dev/bcache0', 'pull']
> >>>> >>> >> 2022-01-27 16:03:58,443 7f897a7a6740 DEBUG /usr/bin/podman:
> 3.3.1
> >>>> >>> >> 2022-01-27 16:03:58,547 7f897a7a6740 INFO Pulling container
> image
> >>>> >>> >> s-8-2-1:/dev/bcache0...
> >>>> >>> >> 2022-01-27 16:03:58,575 7f897a7a6740 DEBUG /usr/bin/podman:
> Error:
> >>>> >>> invalid
> >>>> >>> >> reference format
> >>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 INFO Non-zero exit code
> 125
> >>>> from
> >>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 INFO /usr/bin/podman:
> stderr
> >>>> Error:
> >>>> >>> >> invalid reference format
> >>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 ERROR ERROR: Failed
> command:
> >>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>>> >>> >>
> >>>> >>> >> WBR,
> >>>> >>> >>     Fyodor.
> >>>> >>> >> _______________________________________________
> >>>> >>> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> >>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>> >>> >>
> >>>> >>>
> >>>> > _______________________________________________
> >>>> > ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx