Re: [16.2.6] When adding new host, cephadm deploys ceph image that no longer exists

"Andrew Gunnerson" <accounts.ceph@xxxxxxxxxxxx> · Wed, 29 Sep 2021 15:39:40 -0400

Thank you very much! The previous attempts at adding new hosts with the missing
image seems to have left cephadm in a bad state. We restarted the mgrs and then
did an upgrade to the same version using:

     ceph orch upgrade start --ceph-version 16.2.6

and that seems to have deployed new images with the latest digest.

We were able to successfully add hosts after that.

On Wed, Sep 29, 2021, at 13:06, David Orman wrote:
> It appears when an updated container for 16.2.6 (there was a remoto
> version included with a bug in the first release) was pushed, the old
> one was removed from quay. We had to update our 16.2.6 clusters to the
> 'new' 16.2.6 version, and just did the typical upgrade with the image
> specified. This should resolve your issue, as well as fixing the
> effects of the remoto bug:
>
> https://tracker.ceph.com/issues/50526
> https://github.com/alfredodeza/remoto/pull/63
>
> Once you're upgraded, I would expect it to use the correct hash for
> the host adds.
>
> On Wed, Sep 29, 2021 at 11:02 AM Andrew Gunnerson
> <accounts.ceph@xxxxxxxxxxxx> wrote:
>>
>> Hello all,
>>
>> I'm trying to troubleshoot a test cluster that is attempting to deploy an old
>> quay.io/ceph/ceph@sha256:<hash> image that no longer exists when adding a new
>> host.
>>
>> The cluster is running 16.2.6 and was deployed last week with:
>>
>>     cephadm bootstrap --mon-ip $(facter -p ipaddress) --allow-fqdn-hostname --ssh-user cephadm
>>     # Within "cephadm shell"
>>     ceph orch host add <hostname> <IP> _admin
>>     <repeated for 14 more hosts>
>>
>> This initial cluster worked fine and the mon/mgr/osd/crash/etc containers were
>> all running the following image:
>>
>>     quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c
>>
>> This week, we tried deploying 3 additional hosts using the same "ceph orch host
>> add" commands and cephadm seems to be attempting to deploy the same image, but
>> it no longer exists on quay.io.
>>
>> The error shows up in the active mgr's logs as:
>>
>>     Non-zero exit code 125 from /bin/podman run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint stat --init -e CONTAINER_IMAGE=quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c -e NODE_NAME=<hostname> -e CEPH_USE_RANDOM_NONCE=1 quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c -c %u %g /var/lib/ceph
>>     stat: stderr Trying to pull quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c...
>>     stat: stderr Error: Error initializing source docker://quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c: Error reading manifest sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c in quay.io/ceph/ceph: manifest unknown: manifest unknown
>>
>> I suspect this is because of the container_image global config option:
>>
>>     [ceph: root@<hostname> /]# ceph config-key get config/global/container_image
>>     quay.io/ceph/ceph@sha256:31ad0a2bd8182c948cace326251ce1561804d7de948f370c8c44d29a175cc67c
>>
>> My questions are:
>>
>> * Is it expected for the cluster to reference a (potentially nonexistent) image
>>   by sha256 hash versus (eg.) the :v16 or :v16.2.6 tags?
>>
>> * What's the best way to get back into a state where new hosts can be added
>>   again? Is it sufficient to just update the container_image global config?
>>
>> Thank you!
>> Andrew Gunnerson
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx