osd out cant' bring it back online

Oliver Weinmann <oliver.weinmann@xxxxxx> · Mon, 30 Nov 2020 14:55:19 -0000

Hi,

I'm still evaluating ceph 15.2.5 in a lab so the problem is not really hurting me, but I want to understand it and hopefully fix it. It is a good practice. To test the resilience of the cluster I try to break it by doing all kinds of things. Today I powered off (clean shutdown) one osd node and powered it back on. Last time I tried this there was no problem getting it back online. After a few minutes the cluster health was back to ok. This time it stayed degraded forever. I checked and noticed that the service osd.0 on the osd node was failing. So i used google and there people recommended to simply delete the osd and re-create it. I tried it and still can't get the osd back in service.

First I removed the osd:

[root@gedasvl02 ~]# ceph osd out 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
osd.0 is already out.
[root@gedasvl02 ~]# ceph auth del 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
Error EINVAL: bad entity name
[root@gedasvl02 ~]# ceph auth del osd.0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
updated
[root@gedasvl02 ~]# ceph osd rm 0
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
removed osd.0
[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.43658  root default
-7         0.21829      host gedaopl01
 2    ssd  0.21829          osd.2           up   1.00000  1.00000
-3               0      host gedaopl02
-5         0.21829      host gedaopl03
 3    ssd  0.21829          osd.3           up   1.00000  1.00000

Looks ok it's gone...

Then i zapped it:

[root@gedasvl02 ~]# ceph orch device zap gedaopl02 /dev/sdb --force
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
INFO:cephadm:/usr/bin/podman:stderr WARNING: The same type, major and minor should not be used for multiple devices.
INFO:cephadm:/usr/bin/podman:stderr --> Zapping: /dev/sdb
INFO:cephadm:/usr/bin/podman:stderr --> Zapping lvm member /dev/sdb. lv_path is /dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a/osd-block-3a79800d-2a19-45d8-a850-82c6a8113323 bs=1M count=10 conv=fsync
INFO:cephadm:/usr/bin/podman:stderr  stderr: 10+0 records in
INFO:cephadm:/usr/bin/podman:stderr 10+0 records out
INFO:cephadm:/usr/bin/podman:stderr 10485760 bytes (10 MB, 10 MiB) copied, 0.0314447 s, 333 MB/s
INFO:cephadm:/usr/bin/podman:stderr  stderr:
INFO:cephadm:/usr/bin/podman:stderr --> Only 1 LV left in VG, will proceed to destroy volume group ceph-3bf1bb28-0858-4464-a848-d7f56319b40a
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/sbin/vgremove -v -f ceph-3bf1bb28-0858-4464-a848-d7f56319b40a
INFO:cephadm:/usr/bin/podman:stderr  stderr: Removing ceph--3bf1bb28--0858--4464--a848--d7f56319b40a-osd--block--3a79800d--2a19--45d8--a850--82c6a8113323 (253:0)
INFO:cephadm:/usr/bin/podman:stderr  stderr: Archiving volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" metadata (seqno 5).
INFO:cephadm:/usr/bin/podman:stderr  stderr: Releasing logical volume "osd-block-3a79800d-2a19-45d8-a850-82c6a8113323"
INFO:cephadm:/usr/bin/podman:stderr  stderr: Creating volume group backup "/etc/lvm/backup/ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" (seqno 6).
INFO:cephadm:/usr/bin/podman:stderr  stdout: Logical volume "osd-block-3a79800d-2a19-45d8-a850-82c6a8113323" successfully removed
INFO:cephadm:/usr/bin/podman:stderr  stderr: Removing physical volume "/dev/sdb" from volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a"
INFO:cephadm:/usr/bin/podman:stderr  stdout: Volume group "ceph-3bf1bb28-0858-4464-a848-d7f56319b40a" successfully removed
INFO:cephadm:/usr/bin/podman:stderr Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10 conv=fsync
INFO:cephadm:/usr/bin/podman:stderr  stderr: 10+0 records in
INFO:cephadm:/usr/bin/podman:stderr 10+0 records out
INFO:cephadm:/usr/bin/podman:stderr  stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.0355641 s, 295 MB/s
INFO:cephadm:/usr/bin/podman:stderr --> Zapping successful for: <Raw Device: /dev/sdb>

And re-added it:

[root@gedasvl02 ~]# ceph orch daemon add osd gedaopl02:/dev/sdb
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
Created osd(s) 0 on host 'gedaopl02'

But the osd is still out...

[root@gedasvl02 ~]# ceph osd tree
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
ID  CLASS  WEIGHT   TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         0.43658  root default
-7         0.21829      host gedaopl01
 2    ssd  0.21829          osd.2           up   1.00000  1.00000
-3               0      host gedaopl02
-5         0.21829      host gedaopl03
 3    ssd  0.21829          osd.3           up   1.00000  1.00000
 0               0  osd.0                 down         0  1.00000

Looking at the cluster log in the webui i see the following error:

Failed to apply osd.dashboard-admin-1606745745154 spec DriveGroupSpec(name=dashboard-admin-1606745745154->placement=PlacementSpec(host_pattern='*'), service_id='dashboard-admin-1606745745154', service_type='osd', data_devices=DeviceSelection(size='223.6GB', rotational=False, all=False), osd_id_claims={}, unmanaged=False, filter_logic='AND', preview_only=False): No filters applied Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/module.py", line 2108, in _apply_all_services if self._apply_service(spec): File "/usr/share/ceph/mgr/cephadm/module.py", line 2005, in _apply_service self.osd_service.create_from_spec(cast(DriveGroupSpec, spec)) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 43, in create_from_spec ret = create_from_spec_one(self.prepare_drivegroup(drive_group)) File "/usr/share/ceph/mgr/cephadm/services/osd.py", line 127, in prepare_drivegroup drive_selection = DriveSelection(drive_group, inventory_for_host) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 32, in __init__ self._data = self.assign_devices(self.spec.data_devices) File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 138, in assign_devices if not all(m.compare(disk) for m in FilterGenerator(device_filter)): File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/selector.py", line 138, in <genexpr> if not all(m.compare(disk) for m in FilterGenerator(device_filter)): File "/lib/python3.6/site-packages/ceph/deployment/drive_selection/matchers.py", line 410, in compare raise Exception("No filters applied") Exception: No filters applied

I have another error "pgs undersized", maybe this is also causing trouble?

[root@gedasvl02 ~]# ceph -s
INFO:cephadm:Inferring fsid d0920c36-2368-11eb-a5de-005056b703af
INFO:cephadm:Inferring config /var/lib/ceph/d0920c36-2368-11eb-a5de-005056b703af/mon.gedasvl02/config
INFO:cephadm:Using recent ceph image docker.io/ceph/ceph:v15
  cluster:
    id:     d0920c36-2368-11eb-a5de-005056b703af
    health: HEALTH_WARN
            Degraded data redundancy: 13142/39426 objects degraded (33.333%), 176 pgs degraded, 225 pgs undersized

  services:
    mon: 1 daemons, quorum gedasvl02 (age 2w)
    mgr: gedasvl02.vqswxg(active, since 2w), standbys: gedaopl02.yrwzqh
    mds: cephfs:1 {0=cephfs.gedaopl01.zjuhem=up:active} 1 up:standby
    osd: 3 osds: 2 up (since 4d), 2 in (since 94m)

  task status:
    scrub status:
        mds.cephfs.gedaopl01.zjuhem: idle

  data:
    pools:   7 pools, 225 pgs
    objects: 13.14k objects, 77 GiB
    usage:   148 GiB used, 299 GiB / 447 GiB avail
    pgs:     13142/39426 objects degraded (33.333%)
             176 active+undersized+degraded
             49  active+undersized

  io:
    client:   0 B/s rd, 6.1 KiB/s wr, 0 op/s rd, 0 op/s wr

Best Regards,

Oliver
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx