Re: cephadm trouble

Fyodor Ustinov <ufm@xxxxxx> · Tue, 1 Feb 2022 18:28:54 +0200 (EET)

Hi!

Adam! Big thanx!

"ceph config rm osd.91 container_image" completly solve this trouble.
I don't understand why this happened, but at least now everything works. 

Thank you so much again!

----- Original Message -----
> From: "Fyodor Ustinov" <ufm@xxxxxx>
> To: "Adam King" <adking@xxxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxx>
> Sent: Tuesday, 1 February, 2022 18:12:16
> Subject:  Re: cephadm trouble

> Hi!
> YES! HERE IT IS!
> 
> global             basic     container_image
> quay.io/ceph/ceph@sha256:2f7f0af8663e73a422f797de605e769ae44eb0297f2a79324739404cc1765728
> *
>    osd.91         basic     container_image
>    s-8-2-1:/dev/bcache0
> 
> Two questions:
> 1. How did it get there
> 2. How to delete it - as far as I understand this field is not editable?
> 
> 
> ----- Original Message -----
>> From: "Adam King" <adking@xxxxxxxxxx>
>> To: "Fyodor Ustinov" <ufm@xxxxxx>
>> Cc: "ceph-users" <ceph-users@xxxxxxx>
>> Sent: Tuesday, 1 February, 2022 17:45:13
>> Subject: Re:  Re: cephadm trouble
> 
>> As a follow up to my previous comment, could you also post "ceph config
>> dump | grep container_image". It's related to the repo digest thing and
>> it's another way we could maybe discover where "s-8-2-1:/dev/bcache0" is
>> set as an image.
>> 
>> - Adam King
>> 
>> On Tue, Feb 1, 2022 at 8:52 AM Adam King <adking@xxxxxxxxxx> wrote:
>> 
>>> Hi Fyodor,
>>>
>>> Honestly I'm super confused by your case. Daemon add osd is meant to be a
>>> one time synchronous command so the idea that that is causing this repeated
>>> pull in this fashion is super odd. I think I would need some sort of list
>>> of commands run on this cluster or some type of reproducer. As mentioned
>>> before, cephadm definitely thinks "s-8-2-1:/dev/bcache0" is the name of a
>>> container image but I can't think of where that is set as I didn't see it
>>> in any of the posted service specs or the config options for the any of the
>>> images but it clearly must be set somewhere or we wouldn't be trying to
>>> pull that repeatedly. Never seen an issue like this before. This is a total
>>> long shot, but you could trying setting "ceph config set mgr
>>> mgr/cephadm/use_repo_digest false" and see if it at least lets you refresh
>>> the daemons and make progress (or at least gets us different things in the
>>> logs).
>>>
>>> Sorry for not being too helpful,
>>>
>>> - Adam King
>>>
>>> On Tue, Feb 1, 2022 at 3:27 AM Fyodor Ustinov <ufm@xxxxxx> wrote:
>>>
>>>> Hi!
>>>>
>>>> No mode ideas? :(
>>>>
>>>>
>>>> ----- Original Message -----
>>>> > From: "Fyodor Ustinov" <ufm@xxxxxx>
>>>> > To: "Adam King" <adking@xxxxxxxxxx>
>>>> > Cc: "ceph-users" <ceph-users@xxxxxxx>
>>>> > Sent: Friday, 28 January, 2022 23:02:26
>>>> > Subject:  Re: cephadm trouble
>>>>
>>>> > Hi!
>>>> >
>>>> >> Hmm, I'm not seeing anything that could be a cause in any of that
>>>> output. I
>>>> >> did notice, however, from your "ceph orch ls" output that none of your
>>>> >> services have been refreshed since the 24th. Cephadm typically tries to
>>>> >> refresh these things every 10 minutes so that signals something is
>>>> quite
>>>> >> wrong.
>>>> > From what I see in /var/log/ceph/cephadm.log it tries to run the same
>>>> command
>>>> > once a minute and does nothing else. That's why the status has not been
>>>> updated
>>>> > for 5 days.
>>>> >
>>>> >> Could you try running "ceph mgr fail" and if nothing seems to be
>>>> >> resolved could you post "ceph log last 200 debug cephadm". Maybe we
>>>> can see
>>>> >> if something gets stuck again after the mgr restarts.
>>>> > "ceph mgr fail" did not help.
>>>> > "ceph log last 200 debug cephadm" show again and again and again:
>>>> >
>>>> > 2022-01-28T20:57:12.792090+0000 mgr.s-26-9-24-mon-m2.nhltmq
>>>> (mgr.129738166) 349
>>>> > : cephadm [ERR] cephadm exited with an error code: 1, stderr:Pulling
>>>> container
>>>> > image s-8-2-1:/dev/bcache0...
>>>> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> > /usr/bin/podman: stderr Error: invalid reference format
>>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> > Traceback (most recent call last):
>>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in
>>>> _remote_connection
>>>> >    yield (conn, connr)
>>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
>>>> >    code, '\n'.join(err)))
>>>> > orchestrator._interface.OrchestratorError: cephadm exited with an error
>>>> code: 1,
>>>> > stderr:Pulling container image s-8-2-1:/dev/bcache0...
>>>> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> > /usr/bin/podman: stderr Error: invalid reference format
>>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> > 2022-01-28T20:58:13.092996+0000 mgr.s-26-9-24-mon-m2.nhltmq
>>>> (mgr.129738166) 392
>>>> > : cephadm [ERR] cephadm exited with an error code: 1, stderr:Pulling
>>>> container
>>>> > image s-8-2-1:/dev/bcache0...
>>>> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> > /usr/bin/podman: stderr Error: invalid reference format
>>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> > Traceback (most recent call last):
>>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in
>>>> _remote_connection
>>>> >    yield (conn, connr)
>>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
>>>> >    code, '\n'.join(err)))
>>>> > orchestrator._interface.OrchestratorError: cephadm exited with an error
>>>> code: 1,
>>>> > stderr:Pulling container image s-8-2-1:/dev/bcache0...
>>>> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> > /usr/bin/podman: stderr Error: invalid reference format
>>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> >
>>>> >>
>>>> >> Thanks,
>>>> >>
>>>> >> - Adam King
>>>> >>
>>>> >> On Thu, Jan 27, 2022 at 7:06 PM Fyodor Ustinov <ufm@xxxxxx> wrote:
>>>> >>
>>>> >>> Hi!
>>>> >>>
>>>> >>> I think this happened after I tried to recreate the osd with the
>>>> command
>>>> >>> "ceph orch daemon add osd s-8-2-1:/dev/bcache0"
>>>> >>>
>>>> >>>
>>>> >>> > It looks like cephadm believes "s-8-2-1:/dev/bcache0" is a container
>>>> >>> image
>>>> >>> > for some daemon. Can you provide the output of "ceph orch ls
>>>> --format
>>>> >>> > yaml",
>>>> >>>
>>>> >>> https://pastebin.com/CStBf4J0
>>>> >>>
>>>> >>> > "ceph orch upgrade status",
>>>> >>> root@s-26-9-19-mon-m1:~# ceph orch upgrade status
>>>> >>> {
>>>> >>>     "target_image": null,
>>>> >>>     "in_progress": false,
>>>> >>>     "services_complete": [],
>>>> >>>     "progress": null,
>>>> >>>     "message": ""
>>>> >>> }
>>>> >>>
>>>> >>>
>>>> >>> > "ceph config get mgr container_image",
>>>> >>> root@s-26-9-19-mon-m1:~# ceph config get mgr container_image
>>>> >>>
>>>> >>>
>>>> quay.io/ceph/ceph@sha256:2f7f0af8663e73a422f797de605e769ae44eb0297f2a79324739404cc1765728
>>>> >>>
>>>> >>>
>>>> >>> > and the values for monitoring stack container images (format is
>>>> "ceph
>>>> >>> > config get mgr mgr/cephadm/container_image_<daemon-type>" where
>>>> daemon
>>>> >>> type
>>>> >>> > is one of "prometheus", "node_exporter", "alertmanager", "grafana",
>>>> >>> > "haproxy", "keepalived").
>>>> >>> quay.io/prometheus/prometheus:v2.18.1
>>>> >>> quay.io/prometheus/node-exporter:v0.18.1
>>>> >>> quay.io/prometheus/alertmanager:v0.20.0
>>>> >>> quay.io/ceph/ceph-grafana:6.7.4
>>>> >>> docker.io/library/haproxy:2.3
>>>> >>> docker.io/arcts/keepalived
>>>> >>>
>>>> >>> >
>>>> >>> > Thanks,
>>>> >>> >
>>>> >>> > - Adam King
>>>> >>>
>>>> >>> Thanks a lot!
>>>> >>>
>>>> >>> WBR,
>>>> >>>     Fyodor.
>>>> >>>
>>>> >>> >
>>>> >>> > On Thu, Jan 27, 2022 at 9:10 AM Fyodor Ustinov <ufm@xxxxxx> wrote:
>>>> >>> >
>>>> >>> >> Hi!
>>>> >>> >>
>>>> >>> >> I rebooted the nodes with mgr and now I see the following in the
>>>> >>> >> cephadm.log:
>>>> >>> >>
>>>> >>> >> As I understand it - cephadm is trying to execute some unsuccessful
>>>> >>> >> command of mine (I wonder which one), it does not succeed, but it
>>>> keeps
>>>> >>> >> trying and trying. How do I stop it from trying?
>>>> >>> >>
>>>> >>> >> 2022-01-27 16:02:58,123 7fca7beca740 DEBUG
>>>> >>> >>
>>>> >>>
>>>> --------------------------------------------------------------------------------
>>>> >>> >> cephadm ['--image', 's-8-2-1:/dev/bcache0', 'pull']
>>>> >>> >> 2022-01-27 16:02:58,147 7fca7beca740 DEBUG /usr/bin/podman: 3.3.1
>>>> >>> >> 2022-01-27 16:02:58,249 7fca7beca740 INFO Pulling container image
>>>> >>> >> s-8-2-1:/dev/bcache0...
>>>> >>> >> 2022-01-27 16:02:58,278 7fca7beca740 DEBUG /usr/bin/podman: Error:
>>>> >>> invalid
>>>> >>> >> reference format
>>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 INFO Non-zero exit code 125
>>>> from
>>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 INFO /usr/bin/podman: stderr
>>>> Error:
>>>> >>> >> invalid reference format
>>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 ERROR ERROR: Failed command:
>>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> >>> >> 2022-01-27 16:03:58,420 7f897a7a6740 DEBUG
>>>> >>> >>
>>>> >>>
>>>> --------------------------------------------------------------------------------
>>>> >>> >> cephadm ['--image', 's-8-2-1:/dev/bcache0', 'pull']
>>>> >>> >> 2022-01-27 16:03:58,443 7f897a7a6740 DEBUG /usr/bin/podman: 3.3.1
>>>> >>> >> 2022-01-27 16:03:58,547 7f897a7a6740 INFO Pulling container image
>>>> >>> >> s-8-2-1:/dev/bcache0...
>>>> >>> >> 2022-01-27 16:03:58,575 7f897a7a6740 DEBUG /usr/bin/podman: Error:
>>>> >>> invalid
>>>> >>> >> reference format
>>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 INFO Non-zero exit code 125
>>>> from
>>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 INFO /usr/bin/podman: stderr
>>>> Error:
>>>> >>> >> invalid reference format
>>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 ERROR ERROR: Failed command:
>>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>>> >>> >>
>>>> >>> >> WBR,
>>>> >>> >>     Fyodor.
>>>> >>> >> _______________________________________________
>>>> >>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> >>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>> >>> >>
>>>> >>>
>>>> > _______________________________________________
>>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx