Re: Ceph orchestrator not refreshing device list

Eugen Block <eblock@xxxxxx> · Tue, 29 Oct 2024 14:07:11 +0000

Hi,

I haven't done this in production yet either, but in a test cluster I  
threw away that config-key and it just gets regenerated. So I suppose  
one could try that without any bis risk.
Just a note, this should also work (get instead of dump):

ceph config-key get mgr/cephadm/host.ceph-osd31.devices.0 | jq  
.devices[].created

Zitat von Bob Gibson <rjg@xxxxxxxxxx>:

I enabled debug logging with `ceph config set mgr  
mgr/cephadm/log_to_cluster_level debug` and viewed the logs with  
`ceph -W cephadm --watch-debug`. I can see the orchestrator  
refreshing the device list, and this is reflected in the  
`ceph-volume.log` file on the target osd nodes. When I restart the  
mgr, `ceph orch device ls` reports each device with “5w ago” under  
the “REFRESHED” column. After the orchestrator attempts to refresh  
the device list, `ceph orch device ls` stops outputting any data at  
all until I restart the mgr again.

I discovered that I can query the cached device data using `ceph  
config-key dump`. On the problematic cluster, the `created`  
attribute is stale, e.g.

ceph config-key dump | jq -r  
.'"mgr/cephadm/host.ceph-osd31.devices.0"' | jq .devices[].created
"2024-09-23T17:56:44.914535Z"
"2024-09-23T17:56:44.914569Z"
"2024-09-23T17:56:44.914591Z"
"2024-09-23T17:56:44.914612Z"
"2024-09-23T17:56:44.914632Z"
"2024-09-23T17:56:44.914652Z"
"2024-09-23T17:56:44.914672Z"
"2024-09-23T17:56:44.914692Z"
"2024-09-23T17:56:44.914711Z"
"2024-09-23T17:56:44.914732Z"

whereas on working clusters the `created` attribute is set to the  
time the device information was last cached, e.g.

ceph config-key dump | jq -r  
.'"mgr/cephadm/host.ceph-osd1.devices.0"' | jq .devices[].created
"2024-10-28T21:49:29.510593Z"
"2024-10-28T21:49:29.510635Z"
"2024-10-28T21:49:29.510657Z"
"2024-10-28T21:49:29.510678Z"

It appears that the orchestrator is polling the devices but failing  
to update the cache for some reason. It would be interesting to see  
what happens if I removed one of these device entries from the  
cache, but the cluster is in production so I’m hesitant to poke at it.

We have a maintenance window scheduled in December which will  
provide an opportunity to perform a complete restart of the cluster.  
Hopefully that will clean things up. In the meantime, I’ve set all  
devices to be unmanaged, and the cluster is otherwise healthy, so  
unless anyone has any other ideas to offer I guess I’ll just leave  
things as-is until the maintenance window.

Cheers,
/rjg

On Oct 25, 2024, at 10:31 AM, Bob Gibson <rjg@xxxxxxxxxx> wrote:

[…]
My hunch is that some persistent state is corrupted, or there’s  
something else preventing the orchestrator from successfully  
refreshing its device status, but I don’t know how to troubleshoot  
this. Any ideas?

I don't think this is related to the 'osd' service. As suggested by  
Tobi, enabling cephadm debug will tell you more.

Agreed. I’ll dig through the logs some more today to see if I can  
spot any problems.

Cheers,
/rjg

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx