On 15-09-2023 09:21, Boris Behrens wrote:
Hi Stefan,
the cluster is running 17.6.2 through the board. The mentioned container
with other version don't show in the ceph -s or ceph verions.
It looks like it is host related.
One host get the correct 17.2.6 images, one get the 16.2.11 images and
the third one uses the 7.0.0-7183-g54142666 (whatever this is) images.
root@0cc47a6df330:~# ceph config-key get config/global/container_image
Error ENOENT:
root@0cc47a6df330:~# ceph config-key list |grep container_image
"config-history/12/+mgr.0cc47a6df14e/container_image",
"config-history/13/+mgr.0cc47aad8ce8/container_image",
"config/mgr.0cc47a6df14e/container_image",
"config/mgr.0cc47aad8ce8/container_image",
I've tried to set the detault image to ceph config-key set
config/global/container_image
quay.io/ceph/ceph:v17.2.6@sha256:6b0a24e3146d4723700ce6579d40e6016b2c63d9bf90422653f2d4caa49be232 <http://quay.io/ceph/ceph:v17.2.6@sha256:6b0a24e3146d4723700ce6579d40e6016b2c63d9bf90422653f2d4caa49be232>
But I can not redeploy the mgr daemons, because there is no standby daemon.
root@0cc47a6df330:~# ceph orch redeploy mgr
Error EINVAL: Unable to schedule redeploy for mgr.0cc47aad8ce8: No
standby MGR
But there should be:
root@0cc47a6df330:~# ceph orch ps
NAME HOST PORTS STATUS
REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID
CONTAINER ID
mgr.0cc47a6df14e.iltiot 0cc47a6df14e *:9283 running (23s) 22s ago
2m 10.6M - 16.2.11 de4b0b384ad4 0f31a162fa3e
mgr.0cc47aad8ce8 0cc47aad8ce8 running (16h) 8m ago
16h 591M - 17.2.6 22cd8daf4d70 8145c63fdc44
I guess that one of the managers is not working correctly (probably the
16.2.11 version). IIRC I have changed the image reference for a
container (systemd unit files) once, when I managed to redeploy all
containers with a non-working image (test setup). so first make sure
what manager is actually running, then try to fix the other one by
editing the relevant config for that container (point it to the same
image as the running container). Pull necessary image first if need be.
After you've got a standby manager up and running, you can redeploy the
necessary daemons. Be careful ... there are commands that redeploy all
daemons at the same time, you don't want to do that normally ;-).
root@0cc47a6df330:~# ceph orch ls
NAME PORTS RUNNING REFRESHED AGE PLACEMENT
mgr 2/2 8m ago 19h 0cc47a6df14e;0cc47a6df330;0cc47aad8ce8
I've also remove podman and containerd, kill all directories and then do
a fresh reinstall of podman, which also did not work.
It's also strange that the daemons with the wonky version got an extra
suffix.
If I would now how, I would happily nuke the whole orchestrator, podman
and everything that goes along with it, and start over. In the end it is
not that hard to start some mgr/mon daemons without podman, so I would
be back to a classical cluster.
I tried this yesterday, but the daemons still use that very strange
images and I just don't understand why.
I could just nuke the whole dev cluster, wipe all disks and start fresh
after reinstalling the hosts, but as I have to adopt 17 clusters to the
orchestrator, I rather get some learnings from the not working thing :)
There is actually a cephadm "kill it with fire" option to do that for
you, but yeah, make sure you know how to fix it when things do not go
according to plan. It all magically works, until it doesn't ;-).
Good luck, and keep us updated with any further challenges / progress.
Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx