Re: ceph orch upgrade tries to pull latest?

tobias tempel <tobias.tempel@xxxxxxx> · Fri, 10 Jan 2025 17:31:56 +0100 (CET)

Hi Stephan, hi Adam,

thank you for reading my questions and for your suggestions. 
I did what you suggested, but unfortunately in the meantime the cluster in question degraded to the point of un-manageability, because for reasons incomprehensible for me, cephadm began trying to redeploy and reconfigure daemons (monitors, managers, osds) which failed ... due to, well, failing image registry access. One monitor completely died, could not be redeployed, or removed and re-added, and i only could revive it in a manual old school pre cephadmin non container way.

This 95 osds cluster is used for development and for the evaluation of specific use case performance and does not contain valuable data - i deem it more costly to try to repair it, than to start a new one from scratch.

However this situation once more fuels my assumption, that there is way too much abstraction and automatism in the cephadm containerization way of modern CEPH. Once things go wrong, it's gotten harder to understand, what is happening ... and in this case to prevent it from even getting worse.

Again many thanks for your support. I'm happy, that this community exists.

Cheers, toBias

From: "Stephan Hohn" <stephan@xxxxxxxxxxxx>
To: "Tobias Tempel" <tobias.tempel@xxxxxxx>
Cc: "Adam King" <adking@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxx>
Sent: Thursday, 9 January, 2025 12:31:43
Subject:  Re: ceph orch upgrade tries to pull latest?

Hi Tobias,

have you tried to set your privat registry before starting the upgrade
command.

~# ceph cephadm registry-login <registry url> <username> <password>

e.g. ~# ceph cephadm registry-login harborregistry <username optional>
<password optional>

~# ceph orch upgrade start --image harborregistry/quay.io/ceph/ceph:v18.2.4

This might also help to debug

~# ceph -W cephadm --watch-debug

Cheers

Stephan

Am Do., 9. Jan. 2025 um 12:04 Uhr schrieb tobias tempel <
tobias.tempel@xxxxxxx>:

> Dear Adam,
> thank you very much for your reply.
> In /var/log/ceph/cephadm.log i saw lots of entries like this
>
>   2025-01-08 10:00:22,045 7ff021d8c000 DEBUG
> --------------------------------------------------------------------------------
>   cephadm ['--image', 'harborregistry/quay.io/ceph/ceph', '--timeout',
> '895', 'pull']
>   2025-01-08 10:00:22,172 7ff021d8c000 INFO Pulling container image
> harborregistry/quay.io/ceph/ceph...
>   2025-01-08 10:00:27,176 7ff021d8c000 INFO Non-zero exit code 125 from
> /usr/bin/podman pull harborregistry/quay.io/ceph/ceph
>   2025-01-08 10:00:27,176 7ff021d8c000 INFO /usr/bin/podman: stderr Trying
> to pull harborregistry/quay.io/ceph/ceph:latest...
>   2025-01-08 10:00:27,176 7ff021d8c000 INFO /usr/bin/podman: stderr
> time="2025-01-08T10:00:22+01:00" level=warning msg="failed, retrying in 1s
> ... (1/3). Error: initializing source docker://harborregistry/
> quay.io/ceph/ceph:latest: reading manifest latest in harborregistry/
> quay.io/  ceph/ceph: unknown: resource not found: repo quay.io/ceph/ceph,
> tag latest not found"
> ...
>   2025-01-08 10:00:27,176 7ff021d8c000 INFO /usr/bin/podman: stderr Error:
> initializing source docker://harborregistry/quay.io/ceph/ceph:latest:
> reading manifest latest in harborregistry/quay.io/ceph/ceph: unknown:
> resource not found: repo quay.io/ceph/ceph, tag latest not found
>   2025-01-08 10:00:27,177 7ff021d8c000 ERROR ERROR: Failed command:
> /usr/bin/podman pull harborregistry/quay.io/ceph/ceph
>   2025-01-08 10:01:27,459 7f5f185d0000 DEBUG
> --------------------------------------------------------------------------------
>
> In the meantime i was given a hint, to
>   ceph config set mgr container_image  harborregistry/
> quay.io/ceph/ceph:v18.2.4
> which indeed changed things to
>
>   2025-01-08 17:12:45,952 7ffb1da9b000 DEBUG
> --------------------------------------------------------------------------------
> cephadm ['--image', 'harborregistry/quay.io/ceph/ceph:v18.2.4',
> '--timeout', '895', 'inspect-image']
>   2025-01-08 17:12:46,219 7ffb1da9b000 DEBUG /usr/bin/podman: stdout
> 2bc0b0f4375ddf4270a9a865dfd4e53063acc8e6c3afd7a2546507cafd2ec86a,[
> quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec
> 7e1388855ee3d00813b52c3172b0b23b37906
> quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace8553330e454152b82f95a2b6bf33c3f3ec2eeac77
> harborregistry/quay.io/ceph/ceph@sha256:6ac7f923aa1d23
> b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906 harborregistry/
> quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace8553330e454152b82f95a2b6bf33c3f3ec2eeac77
> ]
>   2025-01-08 17:12:46,649 7ffb1da9b000 DEBUG ceph: stdout ceph version
> 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d) reef (stable)
>   2025-01-08 17:12:50,852 7f21649cf000 DEBUG
> --------------------------------------------------------------------------------
>
> Only to now encounter log entries
>
>   2025-01-09 00:01:20,077 7fe3a719e000 DEBUG
> --------------------------------------------------------------------------------
> cephadm ['--image', 'docker.io/ceph/daemon-base:latest-master-devel',
> '--timeout', '895', '_orch', 'deploy', '--fsid', 'xxxxx']
>   2025-01-09 00:01:20,210 7fe3a719e000 DEBUG Loaded deploy configuration:
> {'fsid': 'xxxxx', 'name': 'mon.monitor0x', 'image': '', 'deploy_arguments':
> [], 'params': {}, 'meta': {'service_name': 'mon', 'ports': [], 'ip': None,
> 'deployed_by': ['
> quay.io/ceph/ceph@sha256:6ac7f923aa1d23b43248ce0ddec7e1388855ee3d00813b52c3172b0b23b37906',
> '
> quay.io/ceph/ceph@sha256:ac06cdca6f2512a763f1ace8553330e454152b82f95a2b6bf33c3f3ec2eeac77'],
> 'rank': None, 'rank_generation': None, 'extra_container_args': None,
> 'extra_entrypoint_args': None}, 'config_blobs': {'config': '# minimal
> ceph.conf for xxxxx\n[global]\n\tfsid = xxxxx\n\tmon_host =
> [v2:x.x.x.x:3300/0,v1:x.x.x.x:6789/0] [v2:x.x.x.x:3300/0,v1:x.x.x.x:6789/0]
> [v2:x.x.x.x:3300/0,v1:x.x.x.x:6789/0]\n[mon.monitor0x]\npublic network =
> x.x.x.0/22\n', 'keyring': '[mon.]\n\tkey = xxxxx\n\tcaps mon = "allow
> *"\n', 'files': {'config': '[mon.monitor0x]\npublic network =
> x.x.x.0/22\n'}}}
>   2025-01-09 00:01:20,210 7fe3a719e000 DEBUG Determined image: '
> docker.io/ceph/daemon-base:latest-master-devel'
>   2025-01-09 00:01:20,218 7fe3a719e000 INFO Redeploy daemon mon.monitor0x
> ...
>   2025-01-09 00:02:20,255 7fe3a719e000 INFO Non-zero exit code 125 from
> /usr/bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host
> --entrypoint stat --init -e CONTAINER_IMAGE=
> docker.io/ceph/daemon-base:latest-master-devel -e NODE_NAME=monitor0x -e
> CEPH_USE_RANDOM_NONCE=1 docker.io/ceph/daemon-base:latest-master-devel -c
> %u %g /var/lib/ceph
>   2025-01-09 00:02:20,255 7fe3a719e000 INFO stat: stderr Trying to pull
> docker.io/ceph/daemon-base:latest-master-devel...
>   2025-01-09 00:02:20,255 7fe3a719e000 INFO stat: stderr Error:
> initializing source docker://ceph/daemon-base:latest-master-devel: pinging
> container registry registry-1.docker.io: Get "
> https://registry-1.docker.io/v2/": dial tcp 54.236.113.205:443: i/o
> timeout
>   2025-01-09 00:02:20,256 7fe3a719e000 ERROR ERROR: Failed to extract
> uid/gid for path /var/lib/ceph: Failed command: /usr/bin/podman run --rm
> --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e
> CONTAINER_IMAGE=docker.io/ceph/daemon-base:latest-master-devel -e
> NODE_NAME=monitor0x -e CEPH_USE_RANDOM_NONCE=1
> docker.io/ceph/daemon-base:latest-master-devel -c %u %g /var/lib/ceph:
> Trying to pull docker.io/ceph/daemon-base:latest-master-devel...
> Error: initializing source docker://ceph/daemon-base:latest-master-devel:
> pinging container registry registry-1.docker.io: Get "
> https://registry-1.docker.io/v2/": dial tcp 54.236.113.205:443: i/o
> timeout
>
>
> After that I was also directed to a config setting "mgr
> mgr/cephadm/default_registry" which up to now is absent from our
> configuration (nothing shows up in "ceph config dump | grep registry")
> but i yet have no idea what to set here ...
>
> Again thank you very much,
> cheers, toBias
>
>
> ------------------------------
> *From: *"Adam King" <adking@xxxxxxxxxx>
> *To: *"Tobias Tempel" <tobias.tempel@xxxxxxx>
> *Cc: *"ceph-users" <ceph-users@xxxxxxx>
> *Sent: *Wednesday, 8 January, 2025 20:15:51
> *Subject: * Re: ceph orch upgrade tries to pull latest?
>
> It looks like the "resource not found" message is being directly output by
> podman. Is there anything in the cephadm.log (/var/log/ceph/cephadm.log) on
> one of the hosts where this is happening that says what podman command
> cephadm was running that hit this error?
>
> On Wed, Jan 8, 2025 at 5:27 AM tobias tempel <tobias.tempel@xxxxxxx>
> wrote:
>
> > Dear all,
> > i'm trying to cephadm-upgrade in an airgapped environment from 18.2.2 to
> > 18.2.4 ... yet to no avail.
> > local image registry is a harbor instance, I start the upgrade process
> with
> >
> >   ceph orch upgrade start --image harborregistry/
> quay.io/ceph/ceph:v18.2.4
> >
> > and status looks good
> >
> >   ceph orch upgrade status
> >   {
> >     "target_image": "harborregistry/quay.io/ceph/ceph:v18.2.4",
> >     "in_progress": true,
> >     "which": "Upgrading all daemon types on all hosts",
> >     "services_complete": [],
> >     "progress": "",
> >     "message": "",
> >     "is_paused": false
> >   }
> >
> > In the cephadm log i can see messages like
> >
> >   cephadm ['--image', 'harborregistry/quay.io/ceph/ceph:v18.2.4',
> > '--timeout', '895', 'inspect-image']
> >
> > which is fine (works on the commandline), but also
> >
> >   2025-01-08 10:33:53,911 7f9c66d50000 INFO /usr/bin/podman: stderr
> Error:
> > initializing source docker://harborregistry/quay.io/ceph/ceph:latest:
> > reading manifest latest in harborregistry/quay.io/ceph/ceph: unknown:
> > resource not found: repo quay.io/ceph/ceph, tag latest not found
> >
> > so for some reason cephadm keeps trying to pull tag "latest" - which i
> did
> > not specify - and this fails ... again and again and again.
> > what am i missing?
> > can anyone give me a hint, where to look at?
> >
> > Thank you very much,
> > cheers, toBias
> >
> > PS: ceph config get mgr
> > WHO  MASK  LEVEL     OPTION
> > VALUE                                               RO
> > mgr        basic     container_image
> > harborregistry/quay.io/ceph/ceph                 *
> > mgr        advanced  mgr/cephadm/container_image_alertmanager
> > harborregistry/quay.io/prometheus/alertmanager   *
> > mgr        advanced  mgr/cephadm/container_image_base
> > harborregistry/quay.io/ceph/ceph
> > mgr        advanced  mgr/cephadm/container_image_grafana
> > harborregistry/quay.io/ceph/ceph-grafana         *
> > mgr        advanced  mgr/cephadm/container_image_node_exporter
> > harborregistry/quay.io/prometheus/node-exporter  *
> > mgr        advanced  mgr/cephadm/container_image_prometheus
> > harborregistry/quay.io/prometheus/prometheus     *
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx