Re: Ceph orchestrator not refreshing device list

Bob Gibson <rjg@xxxxxxxxxx> · Mon, 28 Oct 2024 21:52:48 +0000

I enabled debug logging with `ceph config set mgr mgr/cephadm/log_to_cluster_level debug` and viewed the logs with `ceph -W cephadm --watch-debug`. I can see the orchestrator refreshing the device list, and this is reflected in the `ceph-volume.log` file on the target osd nodes. When I restart the mgr, `ceph orch device ls` reports each device with “5w ago” under the “REFRESHED” column. After the orchestrator attempts to refresh the device list, `ceph orch device ls` stops outputting any data at all until I restart the mgr again.

I discovered that I can query the cached device data using `ceph config-key dump`. On the problematic cluster, the `created` attribute is stale, e.g.

ceph config-key dump | jq -r .'"mgr/cephadm/host.ceph-osd31.devices.0"' | jq .devices[].created
"2024-09-23T17:56:44.914535Z"
"2024-09-23T17:56:44.914569Z"
"2024-09-23T17:56:44.914591Z"
"2024-09-23T17:56:44.914612Z"
"2024-09-23T17:56:44.914632Z"
"2024-09-23T17:56:44.914652Z"
"2024-09-23T17:56:44.914672Z"
"2024-09-23T17:56:44.914692Z"
"2024-09-23T17:56:44.914711Z"
"2024-09-23T17:56:44.914732Z"

whereas on working clusters the `created` attribute is set to the time the device information was last cached, e.g.

ceph config-key dump | jq -r .'"mgr/cephadm/host.ceph-osd1.devices.0"' | jq .devices[].created
"2024-10-28T21:49:29.510593Z"
"2024-10-28T21:49:29.510635Z"
"2024-10-28T21:49:29.510657Z"
"2024-10-28T21:49:29.510678Z"

It appears that the orchestrator is polling the devices but failing to update the cache for some reason. It would be interesting to see what happens if I removed one of these device entries from the cache, but the cluster is in production so I’m hesitant to poke at it.

We have a maintenance window scheduled in December which will provide an opportunity to perform a complete restart of the cluster. Hopefully that will clean things up. In the meantime, I’ve set all devices to be unmanaged, and the cluster is otherwise healthy, so unless anyone has any other ideas to offer I guess I’ll just leave things as-is until the maintenance window.

Cheers,
/rjg

On Oct 25, 2024, at 10:31 AM, Bob Gibson <rjg@xxxxxxxxxx> wrote:

[…]
My hunch is that some persistent state is corrupted, or there’s something else preventing the orchestrator from successfully refreshing its device status, but I don’t know how to troubleshoot this. Any ideas?

I don't think this is related to the 'osd' service. As suggested by Tobi, enabling cephadm debug will tell you more.

Agreed. I’ll dig through the logs some more today to see if I can spot any problems.

Cheers,
/rjg

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx