Hi,
I haven't done this in production yet either, but in a test cluster I
threw away that config-key and it just gets regenerated. So I suppose
one could try that without any bis risk.
Just a note, this should also work (get instead of dump):
ceph config-key get mgr/cephadm/host.ceph-osd31.devices.0 | jq
.devices[].created
Zitat von Bob Gibson <rjg@xxxxxxxxxx>:
I enabled debug logging with `ceph config set mgr
mgr/cephadm/log_to_cluster_level debug` and viewed the logs with
`ceph -W cephadm --watch-debug`. I can see the orchestrator
refreshing the device list, and this is reflected in the
`ceph-volume.log` file on the target osd nodes. When I restart the
mgr, `ceph orch device ls` reports each device with “5w ago” under
the “REFRESHED” column. After the orchestrator attempts to refresh
the device list, `ceph orch device ls` stops outputting any data at
all until I restart the mgr again.
I discovered that I can query the cached device data using `ceph
config-key dump`. On the problematic cluster, the `created`
attribute is stale, e.g.
ceph config-key dump | jq -r
.'"mgr/cephadm/host.ceph-osd31.devices.0"' | jq .devices[].created
"2024-09-23T17:56:44.914535Z"
"2024-09-23T17:56:44.914569Z"
"2024-09-23T17:56:44.914591Z"
"2024-09-23T17:56:44.914612Z"
"2024-09-23T17:56:44.914632Z"
"2024-09-23T17:56:44.914652Z"
"2024-09-23T17:56:44.914672Z"
"2024-09-23T17:56:44.914692Z"
"2024-09-23T17:56:44.914711Z"
"2024-09-23T17:56:44.914732Z"
whereas on working clusters the `created` attribute is set to the
time the device information was last cached, e.g.
ceph config-key dump | jq -r
.'"mgr/cephadm/host.ceph-osd1.devices.0"' | jq .devices[].created
"2024-10-28T21:49:29.510593Z"
"2024-10-28T21:49:29.510635Z"
"2024-10-28T21:49:29.510657Z"
"2024-10-28T21:49:29.510678Z"
It appears that the orchestrator is polling the devices but failing
to update the cache for some reason. It would be interesting to see
what happens if I removed one of these device entries from the
cache, but the cluster is in production so I’m hesitant to poke at it.
We have a maintenance window scheduled in December which will
provide an opportunity to perform a complete restart of the cluster.
Hopefully that will clean things up. In the meantime, I’ve set all
devices to be unmanaged, and the cluster is otherwise healthy, so
unless anyone has any other ideas to offer I guess I’ll just leave
things as-is until the maintenance window.
Cheers,
/rjg
On Oct 25, 2024, at 10:31 AM, Bob Gibson <rjg@xxxxxxxxxx> wrote:
[…]
My hunch is that some persistent state is corrupted, or there’s
something else preventing the orchestrator from successfully
refreshing its device status, but I don’t know how to troubleshoot
this. Any ideas?
I don't think this is related to the 'osd' service. As suggested by
Tobi, enabling cephadm debug will tell you more.
Agreed. I’ll dig through the logs some more today to see if I can
spot any problems.
Cheers,
/rjg
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx