Re: Failed adding back a node

Adam King <adking@xxxxxxxxxx> · Wed, 27 Mar 2024 21:28:28 -0400

>From the ceph versions output I can see

"osd": {
        "ceph version 16.2.10-160.el8cp
(6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160
    },

It seems like all the OSD daemons on this cluster are using that
16.2.10-160 image, and I'm guessing most of them are running, so it must
have existed at some point. Curious if `ceph config dump | grep
container_image` will show a different image setting for the OSD. Anyway,
in terms of moving forward it might be best to try to get all the daemons
onto an image you know works. I also see both 16.2.10-208 and 16.2.10-248
listed as versions, which implies there are two different images being used
even between the other daemons. Unless there's a reason for all these
different images, I'd just pick the most up to date one, that you know can
be pulled on all hosts, and do a `ceph orch upgrade start --image
<image-name>`. That would get all the daemons on that single image, and
might fix the broken OSDs that are failing to pull the 16.2.10-160 image.

On Wed, Mar 27, 2024 at 8:56 PM Alex <mr.alexey@xxxxxxxxx> wrote:

> Hello.
>
> We're rebuilding our OSD nodes.
> Once cluster worked without any issues, this one is being stubborn
>
> I attempted to add one back to the cluster and seeing the error below
> in out logs:
>
> cephadm ['--image',
> 'registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160', 'pull']
> 2024-03-27 19:30:53,901 7f49792ed740 DEBUG /bin/podman: 4.6.1
> 2024-03-27 19:30:53,905 7f49792ed740 INFO Pulling container image
> registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
> 2024-03-27 19:30:54,045 7f49792ed740 DEBUG /bin/podman: Trying to pull
> registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
> 2024-03-27 19:30:54,266 7f49792ed740 DEBUG /bin/podman: Error:
> initializing source
> docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading
> manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8:
> manifest unknown
> 2024-03-27 19:30:54,270 7f49792ed740 INFO Non-zero exit code 125 from
> /bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160
> 2024-03-27
> <http://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-1602024-03-27>
> 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Trying
> to pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160...
> 2024-03-27 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Error:
> initializing source
> docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading
> manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8:
> manifest unknown
> 2024-03-27 19:30:54,270 7f49792ed740 ERROR ERROR: Failed command:
> /bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160
>
> $ ceph versions
> {
>     "mon": {
>         "ceph version 16.2.10-208.el8cp
> (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1,
>         "ceph version 16.2.10-248.el8cp
> (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 2
>     },
>     "mgr": {
>         "ceph version 16.2.10-208.el8cp
> (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1,
>         "ceph version 16.2.10-248.el8cp
> (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 2
>     },
>     "osd": {
>         "ceph version 16.2.10-160.el8cp
> (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160
>     },
>     "mds": {},
>     "rgw": {
>         "ceph version 16.2.10-208.el8cp
> (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 3
>     },
>     "overall": {
>         "ceph version 16.2.10-160.el8cp
> (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160,
>         "ceph version 16.2.10-208.el8cp
> (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 5,
>         "ceph version 16.2.10-248.el8cp
> (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 4
>     }
> }
>
> I don't understand why it's trying to pull 16.2.10-160 which doesn't exist.
>
> registry.redhat.io/rhceph/rhceph-5-dashboard-rhel8 5 93b3137e7a65 11
> months ago 696 MB
> registry.redhat.io/rhceph/rhceph-5-rhel8 5-416 838cea16e15c 11 months
> ago 1.02 GB
> registry.redhat.io/openshift4/ose-prometheus v4.6 ec2d358ca73c 17
> months ago 397 MB
>
>
> This happens using cepadm-ansible as well as
> $ ceph orch ls --export --service_name xxx > xxx.yml
> $ sudo ceph orch apply -i xxx.yml
>
> I tried ceph orch daemon add osd host:/dev/sda
> which surprisingly created a volume on host:/dev/sda and created an
> osd i can see in
> $ ceph osd tree
>
> but It did not get added to host I suspect because of the same Podman
> error and now I'm unable remove it.
> $ ceph orch osd rm
> does not work even with the --force flag.
>
> I stopped the removal with
> $ ceph orch osd rm stop
> after 10+ minutes
>
> I'm considering running $ ceph osd purge osd# --force but worried it
> may only make things worse.
> ceph -s shows that osd but not up or in.
>
> Thanks, and looking forward to any advice!
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx