Hi, I'm still evaluating ceph 15.2.5 in a lab so the problem is not really hurting me, but I want to understand it and hopefully fix it. It is a good practice. To test the resilience of the cluster I try to break it by doing all kinds of things. Today I powered off (clean shutdown) one osd node and powered it back on. Last time I tried this there was no problem getting it back online. After a few minutes the cluster health was back to ok. This time it stayed degraded forever. I checked and noticed that the service osd.0 on the osd node was failing. So i used google and there people recommended to simply delete the osd and re-create it. I tried it and still can't get the osd back in service. First I removed the osd: [root@gedasvl02 ~]# ceph osd out 0 INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 osd.0 is already out. [root@gedasvl02 ~]# ceph auth del 0 INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 Error EINVAL: bad entity name [root@gedasvl02 ~]# ceph auth del osd.0 INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 updated [root@gedasvl02 ~]# ceph osd rm 0 INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 removed osd.0 [root@gedasvl02 ~]# ceph osd tree INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.43658 root default -7 0.21829 host gedaopl01 2 ssd 0.21829 osd.2 up 1.00000 1.00000 -3 0 host gedaopl02 -5 0.21829 host gedaopl03 3 ssd 0.21829 osd.3 up 1.00000 1.00000 Looks ok it's gone... Then i zapped it: [root@gedasvl02 ~]# ceph orch device zap gedaopl02 /dev/sdb --force INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 INFO:cephadm:/usr/bin/podman:stderr WARNING: The same type, major and minor should not be used for multiple devices. INFO:cephadm:/usr/bin/podman:stderr --> Zapping: /dev/sdb INFO:cephadm:/usr/bin/podman:stderr --> Zapping lvm member /dev/sdb. lv_path is /dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323 INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323 bs=1M count=10 conv=fsync INFO:cephadm:/usr/bin/podman:stderr stderr: 10+0 records in INFO:cephadm:/usr/bin/podman:stderr 10+0 records out INFO:cephadm:/usr/bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.0314447 s, 333 MB/s INFO:cephadm:/usr/bin/podman:stderr stderr: INFO:cephadm:/usr/bin/podman:stderr --> Only 1 LV left in VG, will proceed to destroy volume group ceph-3bf1bb28-0858-4464-a848-d7f56319b40a INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/sbin/vgremove -v -f ceph-3bf1bb28-0858-4464-a848-d7f56319b40a INFO:cephadm:/usr/bin/podman:stderr stderr: Removing ceph--3bf1bb28--0858--4464--a848--d7f56319b40a-osd--block--3a79800d--2a19--45d8--a850--82c6a8113323 (253:0) INFO:cephadm:/usr/bin/podman:stderr stderr: Archiving volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" metadata (seqno 5). INFO:cephadm:/usr/bin/podman:stderr stderr: Releasing logical volume "osd-block-3a79800d-2a19-45d8-a850-82c6a8113323" INFO:cephadm:/usr/bin/podman:stderr stderr: Creating volume group backup "/etc/lvm/backup/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" (seqno 6). INFO:cephadm:/usr/bin/podman:stderr stdout: Logical volume "osd-block-3a79800d-2a19-45d8-a850-82c6a8113323" successfully removed INFO:cephadm:/usr/bin/podman:stderr stderr: Removing physical volume "/dev/sdb" from volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" INFO:cephadm:/usr/bin/podman:stderr stdout: Volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" successfully removed INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10 conv=fsync INFO:cephadm:/usr/bin/podman:stderr stderr: 10+0 records in INFO:cephadm:/usr/bin/podman:stderr 10+0 records out INFO:cephadm:/usr/bin/podman:stderr stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0355641 s, 295 MB/s INFO:cephadm:/usr/bin/podman:stderr --> Zapping successful for: <Raw Device: /dev/sdb> And re-added it: [root@gedasvl02 ~]# ceph orch daemon add osd gedaopl02:/dev/sdb INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 Created osd(s) 0 on host 'gedaopl02' But the osd is still out... [root@gedasvl02 ~]# ceph osd tree INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.43658 root default -7 0.21829 host gedaopl01 2 ssd 0.21829 osd.2 up 1.00000 1.00000 -3 0 host gedaopl02 -5 0.21829 host gedaopl03 3 ssd 0.21829 osd.3 up 1.00000 1.00000 0 0 osd.0 down 0 1.00000 Looking at the cluster log in the webui i see the following error: Failed to apply osd.dashboard-admin-1606745745154 spec DriveGroupSpec(name=dashboard-admin-1606745745154->placement=PlacementSpec(host_pattern='*'), service_id='dashboard-admin-1606745745154', service_type='osd', data_devices=DeviceSelection(size='223.6GB', rotational=False, all=False), osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False): No filters applied Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 2108, in _apply_all_services if self._apply_service(spec): File "/usr/share/ceph/mgr/cephadm/module.py", line 2005, in _apply_service self.osd_service.create_from_spec(cast(DriveGroupSpec, spec)) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 43, in create_from_spec ret = create_from_spec_one(self.prepare_drivegroup(drive_group)) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 127, in prepare_drivegroup drive_selection = DriveSelection(drive_group, inventory_for_host) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 32, in __init__ self._data = self.assign_devices(self.spec.data_devices) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 138, in assign_devices if not all(m.compare(disk) for m in FilterGenerator(device_filter)): File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 138, in <genexpr> if not all(m.compare(disk) for m in FilterGenerator(device_filter)): File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/matchers.py", line 410, in compare raise Exception("No filters applied") Exception: No filters applied I have another error "pgs undersized", maybe this is also causing trouble? [root@gedasvl02 ~]# ceph -s INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15 cluster: id: d0920c36-2368-11eb-a5de-005056b703af health: HEALTH_WARN Degraded data redundancy: 13142/39426 objects degraded (33.333%), 176 pgs degraded, 225 pgs undersized services: mon: 1 daemons, quorum gedasvl02 (age 2w) mgr: gedasvl02.vqswxg(active, since 2w), standbys: gedaopl02.yrwzqh mds: cephfs:1 {0=cephfs.gedaopl01.zjuhem=up:active} 1 up:standby osd: 3 osds: 2 up (since 4d), 2 in (since 94m) task status: scrub status: mds.cephfs.gedaopl01.zjuhem: idle data: pools: 7 pools, 225 pgs objects: 13.14k objects, 77 GiB usage: 148 GiB used, 299 GiB / 447 GiB avail pgs: 13142/39426 objects degraded (33.333%) 176 active+undersized+degraded 49 active+undersized io: client: 0 B/s rd, 6.1 KiB/s wr, 0 op/s rd, 0 op/s wr Best Regards, Oliver _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx