Random OSD deployment failures when using 'ceph orch daemon add osd' command

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All :)

I'm trying to debug some interesting behavior I have encountered where I fail to deploy OSD's via ceph orch daemon add osd  command

I deploy my ceph cluster without an OSD specification file. Just a ceph spec file with the cluster hosts listed in it. The bootstrap passes successfully and all hosts are appearing when I execute 'ceph orch host ls' with the correct labels.
===================================================================
HOST                               ADDR          LABELS                        STATUS
controller-0  172.31.0.170  _admin,mon,mgr,mds,rgw,crash
controller-1  172.31.3.232  mon,mgr,mds,rgw,crash,_admin
controller-2  172.31.1.45   mon,mgr,mds,rgw,crash,_admin
ovscompute-0  172.31.0.28   osd,crash,_admin
ovscompute-1  172.31.0.26   osd,crash,_admin
5 hosts in cluster
===================================================================

In my installation setup I have 2 nodes that are labeled as OSD hosts.
Immediately after bootstrapping and 'apply spec' end successfully, and the cluster is up and running, I have some ansible playbook that is executed that executes the 'ceph orch daemon add osd'

I have verified that the OSD's haven't been deployed on the two hosts.

In the ceph-mgr log I see:
===================================================================
2024-08-13T10:06:47.713+0000 7f482b1c5640  0 log_channel(audit) log [DBG] : from='client.14194 -' entity='client.admin' cmd=[{"prefix": "orch daemon add osd", "svc_arg": "ovscompute-1:data_devices=/dev/vdb,/dev/vdc", "target": ["mon-mgr", ""]}]: dispatch
2024-08-13T10:06:47.714+0000 7f482b1c5640  0 log_channel(audit) log [DBG] : from='client.14191 -' entity='client.admin' cmd=[{"prefix": "orch daemon add osd", "svc_arg": "ovscompute-0:data_devices=/dev/vdb,/dev/vdc", "target": ["mon-mgr", ""]}]: dispatch

2024-08-21T07:11:19.167+0000 7f675e4eb640  0 log_channel(cephadm) log [DBG] : Processing DriveGroup DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd
service_name: osd
placement:
  host_pattern: ovscompute-1
spec:
  data_devices:
    paths:
    - /dev/vdb
    - /dev/vdc
  filter_logic: AND
  objectstore: bluestore
'''))
2024-08-21T07:11:19.170+0000 7f675e4eb640  0 log_channel(cephadm) log [DBG] : mon_command: 'osd tree' -> 0 in 0.002s
2024-08-21T07:11:19.171+0000 7f6763535640  0 log_channel(cephadm) log [DBG] : Checking matching hosts -> []
2024-08-21T07:11:19.173+0000 7f675e4eb640  0 log_channel(cephadm) log [DBG] : Processing DriveGroup DriveGroupSpec.from_json(yaml.safe_load('''service_type: osd
service_name: osd
placement:
  host_pattern: ovscompute-0
spec:
  data_devices:
    paths:
    - /dev/vdb
    - /dev/vdc
  filter_logic: AND
  objectstore: bluestore
'''))
2024-08-21T07:11:19.180+0000 7f675e4eb640  0 log_channel(cephadm) log [DBG] : mon_command: 'osd tree' -> 0 in 0.007s
2024-08-21T07:11:19.182+0000 7f6763535640  0 log_channel(cephadm) log [DBG] : Checking matching hosts -> []
===================================================================

The interesting part I found here, that might help understand why the command fails, is this print  Check matching hosts -> [] (from : src/pybind/mgr/cephadm/services/osd.py)
returns an empty list. When this occurs the OSD host doesn't receive the ceph-volume lvm batch  command.

I have another similar setup on which this issue doesn't occur and I see the Check matching hosts actually holds the correct host for the OSD.

I am now debugging this part of the code from (src/pybind/mgr/cephadm/services/osd.py)  to try figure out why the matching_hosts is empty list
===================================================================
matching_hosts = drive_group.placement.filter_matching_hostspecs(
     self.mgr.cache.get_schedulable_hosts())
===================================================================

Lastly, I noticed that when I run:
===================================================================
ceph -W cephadm --watch-debug
===================================================================
before the OSD deployment part, the issue is not reproduced. This information is probably useless but I noticed its consistent and have no idea why this might impact.


Anyone faced a similar issue?

Thanks
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux