Re: [16.2.6] When adding new host, cephadm deploys ceph image that no longer exists

David Orman <ormandj@xxxxxxxxxxxx> · Wed, 29 Sep 2021 12:06:06 -0500

It appears when an updated container for 16.2.6 (there was a remoto
version included with a bug in the first release) was pushed, the old
one was removed from quay. We had to update our 16.2.6 clusters to the
'new' 16.2.6 version, and just did the typical upgrade with the image
specified. This should resolve your issue, as well as fixing the
effects of the remoto bug:

https://tracker.ceph.com/issues/50526
https://github.com/alfredodeza/remoto/pull/63

Once you're upgraded, I would expect it to use the correct hash for
the host adds.

On Wed, Sep 29, 2021 at 11:02 AM Andrew Gunnerson
<accounts.ceph@xxxxxxxxxxxx> wrote:
>
> Hello all,
>
> I'm trying to troubleshoot a test cluster that is attempting to deploy an old
> quay.io/ceph/ceph@sha256:<hash> image that no longer exists when adding a new
> host.
>
> The cluster is running 16.2.6 and was deployed last week with:
>
>     cephadm bootstrap --mon-ip $(facter -p ipaddress) --allow-fqdn-hostname --ssh-user cephadm
>     # Within "cephadm shell"
>     ceph orch host add <hostname> <IP> _admin
>     <repeated for 14 more hosts>
>
> This initial cluster worked fine and the mon/mgr/osd/crash/etc containers were
> all running the following image:
>
>     quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c
>
> This week, we tried deploying 3 additional hosts using the same "ceph orch host
> add" commands and cephadm seems to be attempting to deploy the same image, but
> it no longer exists on quay.io.
>
> The error shows up in the active mgr's logs as:
>
>     Non-zero exit code 125 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c -e NODE_NAME=<hostname> -e CEPH_USE_RANDOM_NONCE=1 quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c -c %u %g /var/lib/ceph
>     stat: stderr Trying to pull quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c...
>     stat: stderr Error: Error initializing source docker://quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c: Error reading manifest sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c in quay.io/ceph/ceph: manifest unknown: manifest unknown
>
> I suspect this is because of the container_image global config option:
>
>     [ceph: root@<hostname> /]# ceph config-key get config/global/container_image
>     quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c
>
> My questions are:
>
> * Is it expected for the cluster to reference a (potentially nonexistent) image
>   by sha256 hash versus (eg.) the :v16 or :v16.2.6 tags?
>
> * What's the best way to get back into a state where new hosts can be added
>   again? Is it sufficient to just update the container_image global config?
>
> Thank you!
> Andrew Gunnerson
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx