Re: cephadm trouble

Fyodor Ustinov <ufm@xxxxxx> · Tue, 1 Feb 2022 18:12:16 +0200 (EET)

Hi!
 YES! HERE IT IS!

global             basic     container_image                           quay.io/ceph/ceph@sha256:2f7f0af8663e73a422f797de605e769ae44eb0297f2a79324739404cc1765728  * 
    osd.91         basic     container_image                           s-8-2-1:/dev/bcache0        

Two questions:
1. How did it get there
2. How to delete it - as far as I understand this field is not editable?

----- Original Message -----
> From: "Adam King" <adking@xxxxxxxxxx>
> To: "Fyodor Ustinov" <ufm@xxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxx>
> Sent: Tuesday, 1 February, 2022 17:45:13
> Subject: Re:  Re: cephadm trouble

> As a follow up to my previous comment, could you also post "ceph config
> dump | grep container_image". It's related to the repo digest thing and
> it's another way we could maybe discover where "s-8-2-1:/dev/bcache0" is
> set as an image.
> 
> - Adam King
> 
> On Tue, Feb 1, 2022 at 8:52 AM Adam King <adking@xxxxxxxxxx> wrote:
> 
>> Hi Fyodor,
>>
>> Honestly I'm super confused by your case. Daemon add osd is meant to be a
>> one time synchronous command so the idea that that is causing this repeated
>> pull in this fashion is super odd. I think I would need some sort of list
>> of commands run on this cluster or some type of reproducer. As mentioned
>> before, cephadm definitely thinks "s-8-2-1:/dev/bcache0" is the name of a
>> container image but I can't think of where that is set as I didn't see it
>> in any of the posted service specs or the config options for the any of the
>> images but it clearly must be set somewhere or we wouldn't be trying to
>> pull that repeatedly. Never seen an issue like this before. This is a total
>> long shot, but you could trying setting "ceph config set mgr
>> mgr/cephadm/use_repo_digest false" and see if it at least lets you refresh
>> the daemons and make progress (or at least gets us different things in the
>> logs).
>>
>> Sorry for not being too helpful,
>>
>> - Adam King
>>
>> On Tue, Feb 1, 2022 at 3:27 AM Fyodor Ustinov <ufm@xxxxxx> wrote:
>>
>>> Hi!
>>>
>>> No mode ideas? :(
>>>
>>>
>>> ----- Original Message -----
>>> > From: "Fyodor Ustinov" <ufm@xxxxxx>
>>> > To: "Adam King" <adking@xxxxxxxxxx>
>>> > Cc: "ceph-users" <ceph-users@xxxxxxx>
>>> > Sent: Friday, 28 January, 2022 23:02:26
>>> > Subject:  Re: cephadm trouble
>>>
>>> > Hi!
>>> >
>>> >> Hmm, I'm not seeing anything that could be a cause in any of that
>>> output. I
>>> >> did notice, however, from your "ceph orch ls" output that none of your
>>> >> services have been refreshed since the 24th. Cephadm typically tries to
>>> >> refresh these things every 10 minutes so that signals something is
>>> quite
>>> >> wrong.
>>> > From what I see in /var/log/ceph/cephadm.log it tries to run the same
>>> command
>>> > once a minute and does nothing else. That's why the status has not been
>>> updated
>>> > for 5 days.
>>> >
>>> >> Could you try running "ceph mgr fail" and if nothing seems to be
>>> >> resolved could you post "ceph log last 200 debug cephadm". Maybe we
>>> can see
>>> >> if something gets stuck again after the mgr restarts.
>>> > "ceph mgr fail" did not help.
>>> > "ceph log last 200 debug cephadm" show again and again and again:
>>> >
>>> > 2022-01-28T20:57:12.792090+0000 mgr.s-26-9-24-mon-m2.nhltmq
>>> (mgr.129738166) 349
>>> > : cephadm [ERR] cephadm exited with an error code: 1, stderr:Pulling
>>> container
>>> > image s-8-2-1:/dev/bcache0...
>>> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> > /usr/bin/podman: stderr Error: invalid reference format
>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> > Traceback (most recent call last):
>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in
>>> _remote_connection
>>> >    yield (conn, connr)
>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
>>> >    code, '\n'.join(err)))
>>> > orchestrator._interface.OrchestratorError: cephadm exited with an error
>>> code: 1,
>>> > stderr:Pulling container image s-8-2-1:/dev/bcache0...
>>> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> > /usr/bin/podman: stderr Error: invalid reference format
>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> > 2022-01-28T20:58:13.092996+0000 mgr.s-26-9-24-mon-m2.nhltmq
>>> (mgr.129738166) 392
>>> > : cephadm [ERR] cephadm exited with an error code: 1, stderr:Pulling
>>> container
>>> > image s-8-2-1:/dev/bcache0...
>>> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> > /usr/bin/podman: stderr Error: invalid reference format
>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> > Traceback (most recent call last):
>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1363, in
>>> _remote_connection
>>> >    yield (conn, connr)
>>> >  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1256, in _run_cephadm
>>> >    code, '\n'.join(err)))
>>> > orchestrator._interface.OrchestratorError: cephadm exited with an error
>>> code: 1,
>>> > stderr:Pulling container image s-8-2-1:/dev/bcache0...
>>> > Non-zero exit code 125 from /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> > /usr/bin/podman: stderr Error: invalid reference format
>>> > ERROR: Failed command: /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> >
>>> >>
>>> >> Thanks,
>>> >>
>>> >> - Adam King
>>> >>
>>> >> On Thu, Jan 27, 2022 at 7:06 PM Fyodor Ustinov <ufm@xxxxxx> wrote:
>>> >>
>>> >>> Hi!
>>> >>>
>>> >>> I think this happened after I tried to recreate the osd with the
>>> command
>>> >>> "ceph orch daemon add osd s-8-2-1:/dev/bcache0"
>>> >>>
>>> >>>
>>> >>> > It looks like cephadm believes "s-8-2-1:/dev/bcache0" is a container
>>> >>> image
>>> >>> > for some daemon. Can you provide the output of "ceph orch ls
>>> --format
>>> >>> > yaml",
>>> >>>
>>> >>> https://pastebin.com/CStBf4J0
>>> >>>
>>> >>> > "ceph orch upgrade status",
>>> >>> root@s-26-9-19-mon-m1:~# ceph orch upgrade status
>>> >>> {
>>> >>>     "target_image": null,
>>> >>>     "in_progress": false,
>>> >>>     "services_complete": [],
>>> >>>     "progress": null,
>>> >>>     "message": ""
>>> >>> }
>>> >>>
>>> >>>
>>> >>> > "ceph config get mgr container_image",
>>> >>> root@s-26-9-19-mon-m1:~# ceph config get mgr container_image
>>> >>>
>>> >>>
>>> quay.io/ceph/ceph@sha256:2f7f0af8663e73a422f797de605e769ae44eb0297f2a79324739404cc1765728
>>> >>>
>>> >>>
>>> >>> > and the values for monitoring stack container images (format is
>>> "ceph
>>> >>> > config get mgr mgr/cephadm/container_image_<daemon-type>" where
>>> daemon
>>> >>> type
>>> >>> > is one of "prometheus", "node_exporter", "alertmanager", "grafana",
>>> >>> > "haproxy", "keepalived").
>>> >>> quay.io/prometheus/prometheus:v2.18.1
>>> >>> quay.io/prometheus/node-exporter:v0.18.1
>>> >>> quay.io/prometheus/alertmanager:v0.20.0
>>> >>> quay.io/ceph/ceph-grafana:6.7.4
>>> >>> docker.io/library/haproxy:2.3
>>> >>> docker.io/arcts/keepalived
>>> >>>
>>> >>> >
>>> >>> > Thanks,
>>> >>> >
>>> >>> > - Adam King
>>> >>>
>>> >>> Thanks a lot!
>>> >>>
>>> >>> WBR,
>>> >>>     Fyodor.
>>> >>>
>>> >>> >
>>> >>> > On Thu, Jan 27, 2022 at 9:10 AM Fyodor Ustinov <ufm@xxxxxx> wrote:
>>> >>> >
>>> >>> >> Hi!
>>> >>> >>
>>> >>> >> I rebooted the nodes with mgr and now I see the following in the
>>> >>> >> cephadm.log:
>>> >>> >>
>>> >>> >> As I understand it - cephadm is trying to execute some unsuccessful
>>> >>> >> command of mine (I wonder which one), it does not succeed, but it
>>> keeps
>>> >>> >> trying and trying. How do I stop it from trying?
>>> >>> >>
>>> >>> >> 2022-01-27 16:02:58,123 7fca7beca740 DEBUG
>>> >>> >>
>>> >>>
>>> --------------------------------------------------------------------------------
>>> >>> >> cephadm ['--image', 's-8-2-1:/dev/bcache0', 'pull']
>>> >>> >> 2022-01-27 16:02:58,147 7fca7beca740 DEBUG /usr/bin/podman: 3.3.1
>>> >>> >> 2022-01-27 16:02:58,249 7fca7beca740 INFO Pulling container image
>>> >>> >> s-8-2-1:/dev/bcache0...
>>> >>> >> 2022-01-27 16:02:58,278 7fca7beca740 DEBUG /usr/bin/podman: Error:
>>> >>> invalid
>>> >>> >> reference format
>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 INFO Non-zero exit code 125
>>> from
>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 INFO /usr/bin/podman: stderr
>>> Error:
>>> >>> >> invalid reference format
>>> >>> >> 2022-01-27 16:02:58,279 7fca7beca740 ERROR ERROR: Failed command:
>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> >>> >> 2022-01-27 16:03:58,420 7f897a7a6740 DEBUG
>>> >>> >>
>>> >>>
>>> --------------------------------------------------------------------------------
>>> >>> >> cephadm ['--image', 's-8-2-1:/dev/bcache0', 'pull']
>>> >>> >> 2022-01-27 16:03:58,443 7f897a7a6740 DEBUG /usr/bin/podman: 3.3.1
>>> >>> >> 2022-01-27 16:03:58,547 7f897a7a6740 INFO Pulling container image
>>> >>> >> s-8-2-1:/dev/bcache0...
>>> >>> >> 2022-01-27 16:03:58,575 7f897a7a6740 DEBUG /usr/bin/podman: Error:
>>> >>> invalid
>>> >>> >> reference format
>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 INFO Non-zero exit code 125
>>> from
>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 INFO /usr/bin/podman: stderr
>>> Error:
>>> >>> >> invalid reference format
>>> >>> >> 2022-01-27 16:03:58,577 7f897a7a6740 ERROR ERROR: Failed command:
>>> >>> >> /usr/bin/podman pull s-8-2-1:/dev/bcache0
>>> >>> >>
>>> >>> >> WBR,
>>> >>> >>     Fyodor.
>>> >>> >> _______________________________________________
>>> >>> >> ceph-users mailing list -- ceph-users@xxxxxxx
>>> >>> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> >>> >>
>>> >>>
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx