Re: ceph orchestator pulls strange images from docker.io

Stefan Kooman <stefan@xxxxxx> · Fri, 15 Sep 2023 10:25:30 +0200

On 15-09-2023 09:21, Boris Behrens wrote:
Hi Stefan,

the cluster is running 17.6.2 through the board. The mentioned container 
with other version don't show in the ceph -s or ceph verions.
It looks like it is host related.
One host get the correct 17.2.6 images, one get the 16.2.11 images and 
the third one uses the 7.0.0-7183-g54142666 (whatever this is) images.

root@0cc47a6df330:~# ceph config-key get config/global/container_image
Error ENOENT:

root@0cc47a6df330:~# ceph config-key list |grep container_image
     "config-history/12/+mgr.0cc47a6df14e/container_image",
     "config-history/13/+mgr.0cc47aad8ce8/container_image",
     "config/mgr.0cc47a6df14e/container_image",
     "config/mgr.0cc47aad8ce8/container_image",

I've tried to set the detault image to ceph config-key set 
config/global/container_image 
quay.io/ceph/ceph:v17.2.6@sha256:6b0a24e3146d4723700ce6579d40e6016b2c63d9bf90422653f2d4caa49be232 <http://quay.io/ceph/ceph:v17.2.6@sha256:6b0a24e3146d4723700ce6579d40e6016b2c63d9bf90422653f2d4caa49be232>
But I can not redeploy the mgr daemons, because there is no standby daemon.

root@0cc47a6df330:~# ceph orch redeploy mgr
Error EINVAL: Unable to schedule redeploy for mgr.0cc47aad8ce8: No 
standby MGR

But there should be:
root@0cc47a6df330:~# ceph orch ps
NAME                     HOST                             PORTS   STATUS 
         REFRESHED  AGE  MEM USE  MEM LIM  VERSION    IMAGE ID     
  CONTAINER ID
mgr.0cc47a6df14e.iltiot  0cc47a6df14e  *:9283  running (23s)    22s ago 
   2m    10.6M        -  16.2.11    de4b0b384ad4  0f31a162fa3e
mgr.0cc47aad8ce8         0cc47aad8ce8          running (16h)     8m ago 
  16h     591M        -  17.2.6     22cd8daf4d70  8145c63fdc44

I guess that one of the managers is not working correctly (probably the 
16.2.11 version). IIRC I have changed the image reference for a 
container (systemd unit files) once, when I managed to redeploy all 
containers with a non-working image (test setup). so first make sure 
what manager is actually running, then try to fix the other one by 
editing the relevant config for that container (point it to the same 
image as the running container). Pull necessary image first if need be. 
After you've got a standby manager up and running, you can redeploy the 
necessary daemons. Be careful ... there are commands that redeploy all 
daemons at the same time, you don't want to do that normally ;-).

root@0cc47a6df330:~# ceph orch ls
NAME  PORTS  RUNNING  REFRESHED  AGE  PLACEMENT
mgr              2/2  8m ago     19h  0cc47a6df14e;0cc47a6df330;0cc47aad8ce8

I've also remove podman and containerd, kill all directories and then do 
a fresh reinstall of podman, which also did not work.
It's also strange that the daemons with the wonky version got an extra 
suffix.

If I would now how, I would happily nuke the whole orchestrator, podman 
and everything that goes along with it, and start over. In the end it is 
not that hard to start some mgr/mon daemons without podman, so I would 
be back to a classical cluster.
I tried this yesterday, but the daemons still use that very strange 
images and I just don't understand why.

I could just nuke the whole dev cluster, wipe all disks and start fresh 
after reinstalling the hosts, but as I have to adopt 17 clusters to the 
orchestrator, I rather get some learnings from the not working thing :)

There is actually a cephadm "kill it with fire" option to do that for 
you, but yeah, make sure you know how to fix it when things do not go 
according to plan. It all magically works, until it doesn't ;-).

Good luck, and keep us updated with any further challenges / progress.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx