unable to deploy ceph -- failed to read label for XXX No such file or directory

Radoslav Bodó <bodik@xxxxxxxxx> · Sun, 16 Apr 2023 23:38:06 +0200

hello,

during basic experimentation I'm running into wierd situaltion when 
adding osd to test cluster. The test cluster is created as 3x XEN DomU 
Debian Bookworm (test1-3), 4x CPU, 8GB RAM, xvda root, xvbd swap, 4x 
xvdj,k,l,m 20GB (LVM volumes in Dom0, propagated via xen phy device) and 
cleaned with `wipefs -a`

```
apt-get install cephadm ceph-common
cephadm bootstrap --mon-ip 10.0.0.101
ceph orch host add test2
ceph orch host add test3
```

when adding OSDs the first host gets created OSDs as expected, but 
during creating OSDs on second host the output gets wierd, even when 
adding each device separately the output shows that `ceph orch` tries to 
create multiple osds at once

```
root@test1:~# for xxx in j k l m; do ceph orch daemon add osd 
test2:/dev/xvd$xxx; done
Created osd(s) 0,1,2,3 on host 'test2'
Created osd(s) 0,1 on host 'test2'
Created osd(s) 2,3 on host 'test2'
Created osd(s) 1 on host 'test2'
```

the syslog on test2 node shows an errors

```
2023-04-16T20:57:02.528456+00:00 test2 bash[10426]: cephadm 
2023-04-16T20:57:01.389951+0000 mgr.test1.ucudzp (mgr.14206) 1691 : 
cephadm [INF] Found duplicate OSDs: osd.0 in status running on test1, 
osd.0 in status error on test2

2023-04-16T20:57:02.528748+00:00 test2 bash[10426]: cephadm
2023-04-16T20:57:01.391346+0000 mgr.test1.ucudzp (mgr.14206) 1692 : 
cephadm [INF] Removing daemon osd.0 from test2 -- ports []
2023-04-16T20:57:02.528943+00:00 test2 bash[10426]: cluster
2023-04-16T20:57:02.350564+0000 mon.test1 (mon.0) 743 : cluster [WRN] 
Health check failed: 2 failed cephadm daemon(s) (CEPHADM_FAILED_DAEMON)

2023-04-16T20:57:17.972962+00:00 test2 bash[20098]:  stderr: failed to 
read label for 
/dev/ceph-48f3646c-7070-4a37-b9a4-ed0a4a983965/osd-block-11a0dc2b-f8e1-4694-813f-2309ab6a5c1d: 
(2) No such file or directory
2023-04-16T20:57:17.973064+00:00 test2 bash[20098]:  stderr: 
2023-04-16T20:57:17.962+0000 7fad2451c540 -1 
bluestore(/dev/ceph-48f3646c-7070-4a37-b9a4-ed0a4a983965/osd-block-11a0dc2b-f8e1-4694-813f-2309ab6a5c1d) 
_read_bdev_label failed to open 
/dev/ceph-48f3646c-7070-4a37-b9a4-ed0a4a983965/osd-block-11a0dc2b-f8e1-4694-813f-2309ab6a5c1d: 
(2) No such file or directory
2023-04-16T20:57:17.973181+00:00 test2 bash[20098]: --> Failed to 
activate via lvm: command returned non-zero exit status: 1
2023-04-16T20:57:17.973278+00:00 test2 bash[20098]: --> Failed to 
activate via simple: 'Namespace' object has no attribute 'json_config'
2023-04-16T20:57:17.973368+00:00 test2 bash[20098]: --> Failed to 
activate any OSD(s)
```

the ceph and cephadm binaries are installed from debian bookworm

```
ii  ceph-common    16.2.11+ds-2 amd64        common utilities to mount 
and interact with a ceph storage cluster
ii  cephadm        16.2.11+ds-2 amd64        utility to bootstrap ceph 
daemons with systemd and containers
```

management session script can be found at https://pastebin.com/raw/FiX7DMHS

none of the googled symptoms helped me to understand why is this 
situation happening nor how to troubleshoot or debug the issues. I'd 
understand that the nodes are very log on RAM to get this experiment 
running, but the behavior does not really look like OOM issue.

any idea would be appreciated

thanks
bodik
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx