Cannot add disks back after their OSDs were drained and removed from a cluster

stillsmil@xxxxxxxxx · Mon, 17 Apr 2023 05:27:09 -0000

I find that I cannot re-add a disk to a Ceph cluster after the OSD on the disk is removed. Ceph seems to know about the existence of these disks, but not about their "host:dev" information:

```
# ceph device ls
DEVICE                                     HOST:DEV         DAEMONS  WEAR  LIFE EXPECTANCY
SAMSUNG_MZ7L37T6_S6KHNE0T100049                                                    0%   <-- should be host01:sda, was osd.0
SAMSUNG_MZ7L37T6_S6KHNE0T100050            host01:sdb       osd.1      0%
SAMSUNG_MZ7L37T6_S6KHNE0T100052                                                    0%   <-- should be host02:sda
SAMSUNG_MZ7L37T6_S6KHNE0T100053            host01:sde       osd.9      0%
SAMSUNG_MZ7L37T6_S6KHNE0T100061            host01:sdf       osd.11     0%
SAMSUNG_MZ7L37T6_S6KHNE0T100062                                                    0%
SAMSUNG_MZ7L37T6_S6KHNE0T100063            host01:sdc       osd.5      0%
SAMSUNG_MZ7L37T6_S6KHNE0T100064            host01:sdg       osd.13     0%
SAMSUNG_MZ7L37T6_S6KHNE0T100065                                                    0%
SAMSUNG_MZ7L37T6_S6KHNE0T100066            host01:sdd       osd.7      0%
SAMSUNG_MZ7L37T6_S6KHNE0T100067                                                    0%
SAMSUNG_MZ7L37T6_S6KHNE0T100068                                                    0%    <-- should be host02:sdb
SAMSUNG_MZ7L37T6_S6KHNE0T100069                                                    0%
SAMSUNG_MZ7L37T6_S6KHNE0T100070                                                    0%
SAMSUNG_MZ7L37T6_S6KHNE0T100071                                                    0%
SAMSUNG_MZ7L37T6_S6KHNE0T100072            host01:sdh       osd.15     0%
SAMSUNG_MZQL27T6HBLA-00B7C_S6CVNG0T321600  host03:nvme4n1   osd.20     0%
... obmitted ...
SAMSUNG_MZQL27T6HBLA-00B7C_S6CVNG0T321608  host03:nvme8n1   osd.22     0%
```

For disk "SAMSUNG_MZ7L37T6_S6KHNE0T100049", the "HOST:DEV" field is empty, while I believe it should be "host01:sda", as I have confirmed by running `smartctl -i /dev/sda" on host01.

I guess the missing information is the reason that OSDs cannot be created, either manually or automatically on these devices. I have tried:
1. `ceph orch daemon add osd host01:/dev/sda` Prints "Created no osd(s) on host host01; already created?"
2. `ceph orch apply osd --all-available-devices` adds nothing

I arrived at such situation while testing if draining a host works: I drained host02, removed it and added it back via:
```
ceph orch host drain host02
ceph orch host rm host02
ceph orch host add host02 <internal_ip> --labels _admin
```

I am running Ceph 17.2.6 on ubuntu. Ceph was deployed via cephadm. FYI the relavent orchestrator spec for osd:
```
# ceph orch ls osd --export
service_type: osd
service_name: osd
unmanaged: true
spec:
  filter_logic: AND
  objectstore: bluestore
---
service_type: osd
service_id: all-available-devices
service_name: osd.all-available-devices
placement:
  host_pattern: '*'
spec:
  data_devices:
    all: true
  filter_logic: AND
  objectstore: bluestore
```

Any thoughts on what maybe wrong here? Is there a way that I can tell Ceph "you are wrong about the whereabouts of these disks, forget what you know and fetch disk information afresh"?

Any help much appreciated!
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx