>From the ceph versions output I can see "osd": { "ceph version 16.2.10-160.el8cp (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160 }, It seems like all the OSD daemons on this cluster are using that 16.2.10-160 image, and I'm guessing most of them are running, so it must have existed at some point. Curious if `ceph config dump | grep container_image` will show a different image setting for the OSD. Anyway, in terms of moving forward it might be best to try to get all the daemons onto an image you know works. I also see both 16.2.10-208 and 16.2.10-248 listed as versions, which implies there are two different images being used even between the other daemons. Unless there's a reason for all these different images, I'd just pick the most up to date one, that you know can be pulled on all hosts, and do a `ceph orch upgrade start --image <image-name>`. That would get all the daemons on that single image, and might fix the broken OSDs that are failing to pull the 16.2.10-160 image. On Wed, Mar 27, 2024 at 8:56 PM Alex <mr.alexey@xxxxxxxxx> wrote: > Hello. > > We're rebuilding our OSD nodes. > Once cluster worked without any issues, this one is being stubborn > > I attempted to add one back to the cluster and seeing the error below > in out logs: > > cephadm ['--image', > 'registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160', 'pull'] > 2024-03-27 19:30:53,901 7f49792ed740 DEBUG /bin/podman: 4.6.1 > 2024-03-27 19:30:53,905 7f49792ed740 INFO Pulling container image > registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160... > 2024-03-27 19:30:54,045 7f49792ed740 DEBUG /bin/podman: Trying to pull > registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160... > 2024-03-27 19:30:54,266 7f49792ed740 DEBUG /bin/podman: Error: > initializing source > docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading > manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8: > manifest unknown > 2024-03-27 19:30:54,270 7f49792ed740 INFO Non-zero exit code 125 from > /bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160 > 2024-03-27 > <http://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-1602024-03-27> > 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Trying > to pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160... > 2024-03-27 19:30:54,270 7f49792ed740 INFO /bin/podman: stderr Error: > initializing source > docker://registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160: reading > manifest 16.2.10-160 in registry.redhat.io/rhceph/rhceph-5-rhel8: > manifest unknown > 2024-03-27 19:30:54,270 7f49792ed740 ERROR ERROR: Failed command: > /bin/podman pull registry.redhat.io/rhceph/rhceph-5-rhel8:16.2.10-160 > > $ ceph versions > { > "mon": { > "ceph version 16.2.10-208.el8cp > (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1, > "ceph version 16.2.10-248.el8cp > (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 2 > }, > "mgr": { > "ceph version 16.2.10-208.el8cp > (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 1, > "ceph version 16.2.10-248.el8cp > (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 2 > }, > "osd": { > "ceph version 16.2.10-160.el8cp > (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160 > }, > "mds": {}, > "rgw": { > "ceph version 16.2.10-208.el8cp > (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 3 > }, > "overall": { > "ceph version 16.2.10-160.el8cp > (6977980612de1db28e41e0a90ff779627cde7a8c) pacific (stable)": 160, > "ceph version 16.2.10-208.el8cp > (791f73fbb4bbca2ffe53a2ea0f8706dbffadcc0b) pacific (stable)": 5, > "ceph version 16.2.10-248.el8cp > (0edb63afd9bd3edb333364f2e0031b77e62f4896) pacific (stable)": 4 > } > } > > I don't understand why it's trying to pull 16.2.10-160 which doesn't exist. > > registry.redhat.io/rhceph/rhceph-5-dashboard-rhel8 5 93b3137e7a65 11 > months ago 696 MB > registry.redhat.io/rhceph/rhceph-5-rhel8 5-416 838cea16e15c 11 months > ago 1.02 GB > registry.redhat.io/openshift4/ose-prometheus v4.6 ec2d358ca73c 17 > months ago 397 MB > > > This happens using cepadm-ansible as well as > $ ceph orch ls --export --service_name xxx > xxx.yml > $ sudo ceph orch apply -i xxx.yml > > I tried ceph orch daemon add osd host:/dev/sda > which surprisingly created a volume on host:/dev/sda and created an > osd i can see in > $ ceph osd tree > > but It did not get added to host I suspect because of the same Podman > error and now I'm unable remove it. > $ ceph orch osd rm > does not work even with the --force flag. > > I stopped the removal with > $ ceph orch osd rm stop > after 10+ minutes > > I'm considering running $ ceph osd purge osd# --force but worried it > may only make things worse. > ceph -s shows that osd but not up or in. > > Thanks, and looking forward to any advice! > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx