Re: cephadm trouble

Adam King <adking@xxxxxxxxxx> · Tue, 1 Feb 2022 08:52:58 -0500

Hi Fyodor,

Honestly I'm super confused by your case. Daemon add osd is meant to be a
one time synchronous command so the idea that that is causing this repeated
pull in this fashion is super odd. I think I would need some sort of list
of commands run on this cluster or some type of reproducer. As mentioned
before, cephadm definitely thinks "s-8-2-1:/dev/bcache0" is the name of a
container image but I can't think of where that is set as I didn't see it
in any of the posted service specs or the config options for the any of the
images but it clearly must be set somewhere or we wouldn't be trying to
pull that repeatedly. Never seen an issue like this before. This is a total
long shot, but you could trying setting "ceph config set mgr
mgr/cephadm/use_repo_digest false" and see if it at least lets you refresh
the daemons and make progress (or at least gets us different things in the
logs).

Sorry for not being too helpful,

- Adam King

On Tue, Feb 1, 2022 at 3:27 AM Fyodor Ustinov <ufm@xxxxxx> wrote:

> Hi!
>
> No mode ideas? :(
>
>
> ----- Original Message -----
> > From: "Fyodor Ustinov" <ufm@xxxxxx>
> > To: "Adam King" <adking@xxxxxxxxxx>
> > Cc: "ceph-users" <ceph-users@xxxxxxx>
> > Sent: Friday, 28 January, 2022 23:02:26
> > Subject:  Re: cephadm trouble
>
> > Hi!
> >
> >> Hmm, I'm not seeing anything that could be a cause in any of that
> output. I
> >> did notice, however, from your "ceph orch ls" output that none of your
> >> services have been refreshed since the 24th. Cephadm typically tries to
> >> refresh these things every 10 minutes so that signals something is quite
> >> wrong.
> > From what I see in /var/log/ceph/cephadm.log it tries to run the same
> command
> > once a minute and does nothing else. That's why the status has not been
> updated
> > for 5 days.
> >
> >> Could you try running "ceph mgr fail" and if nothing seems to be
> >> resolved could you post "ceph log last 200 debug cephadm". Maybe we can
> see
> >> if something gets stuck again after the mgr restarts.
> > "ceph mgr fail" did not help.
> > "ceph log last 200 debug cephadm" show again and again and again:
> >
> > 2022-01-28T20:57:12.792090+0000 mgr.s-26-9-24-mon-m2.nhltmq
> (mgr.129738166) 349
> > : cephadm [ERR] cephadm exited with an error code: 1, stderr:Pulling
> container
> > image s-8-2-1:/dev/bcache0...
> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
> > /usr/bin/podman: stderr Error: invalid reference format
> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
> > Traceback (most recent call last):
> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in
> _remote_connection
> >    yield (conn, connr)
> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
> >    code, '\n'.join(err)))
> > orchestrator._interface.OrchestratorError: cephadm exited with an error
> code: 1,
> > stderr:Pulling container image s-8-2-1:/dev/bcache0...
> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
> > /usr/bin/podman: stderr Error: invalid reference format
> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
> > 2022-01-28T20:58:13.092996+0000 mgr.s-26-9-24-mon-m2.nhltmq
> (mgr.129738166) 392
> > : cephadm [ERR] cephadm exited with an error code: 1, stderr:Pulling
> container
> > image s-8-2-1:/dev/bcache0...
> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
> > /usr/bin/podman: stderr Error: invalid reference format
> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
> > Traceback (most recent call last):
> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in
> _remote_connection
> >    yield (conn, connr)
> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
> >    code, '\n'.join(err)))
> > orchestrator._interface.OrchestratorError: cephadm exited with an error
> code: 1,
> > stderr:Pulling container image s-8-2-1:/dev/bcache0...
> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
> > /usr/bin/podman: stderr Error: invalid reference format
> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >
> >>
> >> Thanks,
> >>
> >> - Adam King
> >>
> >> On Thu, Jan 27, 2022 at 7:06 PM Fyodor Ustinov <ufm@xxxxxx> wrote:
> >>
> >>> Hi!
> >>>
> >>> I think this happened after I tried to recreate the osd with the
> command
> >>> "ceph orch daemon add osd s-8-2-1:/dev/bcache0"
> >>>
> >>>
> >>> > It looks like cephadm believes "s-8-2-1:/dev/bcache0" is a container
> >>> image
> >>> > for some daemon. Can you provide the output of "ceph orch ls --format
> >>> > yaml",
> >>>
> >>> https://pastebin.com/CStBf4J0
> >>>
> >>> > "ceph orch upgrade status",
> >>> root@s-26-9-19-mon-m1:~# ceph orch upgrade status
> >>> {
> >>>     "target_image": null,
> >>>     "in_progress": false,
> >>>     "services_complete": [],
> >>>     "progress": null,
> >>>     "message": ""
> >>> }
> >>>
> >>>
> >>> > "ceph config get mgr container_image",
> >>> root@s-26-9-19-mon-m1:~# ceph config get mgr container_image
> >>>
> >>>
> quay.io/ceph/ceph@sha256:2f7f0af8663e73a422f797de605e769ae44eb0297f2a79324739404cc1765728
> >>>
> >>>
> >>> > and the values for monitoring stack container images (format is "ceph
> >>> > config get mgr mgr/cephadm/container_image_<daemon-type>" where
> daemon
> >>> type
> >>> > is one of "prometheus", "node_exporter", "alertmanager", "grafana",
> >>> > "haproxy", "keepalived").
> >>> quay.io/prometheus/prometheus:v2.18.1
> >>> quay.io/prometheus/node-exporter:v0.18.1
> >>> quay.io/prometheus/alertmanager:v0.20.0
> >>> quay.io/ceph/ceph-grafana:6.7.4
> >>> docker.io/library/haproxy:2.3
> >>> docker.io/arcts/keepalived
> >>>
> >>> >
> >>> > Thanks,
> >>> >
> >>> > - Adam King
> >>>
> >>> Thanks a lot!
> >>>
> >>> WBR,
> >>>     Fyodor.
> >>>
> >>> >
> >>> > On Thu, Jan 27, 2022 at 9:10 AM Fyodor Ustinov <ufm@xxxxxx> wrote:
> >>> >
> >>> >> Hi!
> >>> >>
> >>> >> I rebooted the nodes with mgr and now I see the following in the
> >>> >> cephadm.log:
> >>> >>
> >>> >> As I understand it - cephadm is trying to execute some unsuccessful
> >>> >> command of mine (I wonder which one), it does not succeed, but it
> keeps
> >>> >> trying and trying. How do I stop it from trying?
> >>> >>
> >>> >> 2022-01-27 16:02:58,123 7fca7beca740 DEBUG
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> cephadm ['--image', 's-8-2-1:/dev/bcache0', 'pull']
> >>> >> 2022-01-27 16:02:58,147 7fca7beca740 DEBUG /usr/bin/podman: 3.3.1
> >>> >> 2022-01-27 16:02:58,249 7fca7beca740 INFO Pulling container image
> >>> >> s-8-2-1:/dev/bcache0...
> >>> >> 2022-01-27 16:02:58,278 7fca7beca740 DEBUG /usr/bin/podman: Error:
> >>> invalid
> >>> >> reference format
> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 INFO Non-zero exit code 125
> from
> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 INFO /usr/bin/podman: stderr
> Error:
> >>> >> invalid reference format
> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 ERROR ERROR: Failed command:
> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>> >> 2022-01-27 16:03:58,420 7f897a7a6740 DEBUG
> >>> >>
> >>>
> --------------------------------------------------------------------------------
> >>> >> cephadm ['--image', 's-8-2-1:/dev/bcache0', 'pull']
> >>> >> 2022-01-27 16:03:58,443 7f897a7a6740 DEBUG /usr/bin/podman: 3.3.1
> >>> >> 2022-01-27 16:03:58,547 7f897a7a6740 INFO Pulling container image
> >>> >> s-8-2-1:/dev/bcache0...
> >>> >> 2022-01-27 16:03:58,575 7f897a7a6740 DEBUG /usr/bin/podman: Error:
> >>> invalid
> >>> >> reference format
> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 INFO Non-zero exit code 125
> from
> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 INFO /usr/bin/podman: stderr
> Error:
> >>> >> invalid reference format
> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 ERROR ERROR: Failed command:
> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
> >>> >>
> >>> >> WBR,
> >>> >>     Fyodor.
> >>> >> _______________________________________________
> >>> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>> >>
> >>>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx