Thanks Eugen. Now that you mention it, it was rather silly of me to attempt to use the orchestrator to remove an unmanaged resource :-) Your example for managing devices is very similar to what I’m trying to do, and what has been working for us on other clusters. I’m using a separate osd spec per node, like the following: service_type: osd service_id: ceph-osd31 placement: hosts: - ceph-osd31 spec: data_devices: rotational: 0 size: '3TB:' encrypted: true filter_logic: AND objectstore: bluestore I tried explicitly adding `unmanaged: false`, but that didn’t make a difference. In fact, after doing so, `ceph orch ls osd osd.ceph-osd31 —export` doesn’t include it, so it must be implied. # ceph orch ls osd osd.ceph-osd31 --export service_type: osd service_id: ceph-osd31 service_name: osd.ceph-osd31 placement: hosts: - ceph-osd31 spec: data_devices: rotational: 0 size: '3TB:' encrypted: true filter_logic: AND objectstore: bluestore My misunderstanding was assuming that devices had to be recreated before they could be managed. Thanks for setting me straight on that. I think the orchestrator’s inability to refresh the device list is the crux of our problem. I’ll scrutinize the logs again, as Tobi suggested. Cheers, /rjg On Oct 24, 2024, at 3:49 AM, Eugen Block <eblock@xxxxxx> wrote: EXTERNAL EMAIL | USE CAUTION Hi, there are a couple of ways to get your OSDs into "managed" state. You can't remove the "unmanaged" service because it's unmanaged. ;-) Just an example from a test cluster where I adopted three OSDs, now they're unmanaged as expected: soc9-ceph:~ # ceph orch ls osd NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd 3 30s ago - <unmanaged> Now I apply a spec with the same "service_name": soc9-ceph:~ # cat osd-unmanaged.yaml service_type: osd service_name: osd unmanaged: false placement: label: osd spec: data_devices: rotational: 1 size: 10G db_devices: rotational: 0 size: 15G filter_logic: AND objectstore: bluestore I explicitly used the "unmanaged: false" statement to show that I want this service_name to be managed: soc9-ceph:~ # ceph orch apply -i osd-unmanaged.yaml Scheduled osd update... soc9-ceph:~ # ceph orch ls osd NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd 3 2m ago 8s label:osd Now all my OSDs are managed. So in this case, this is already my target drivegroup, should I need to replace a drive, this spec would be applied. I would recommend to create a spec that matches your requirements (for example encrypted) so your OSDs are managed, then you could recreate them one by one. Regarding the failing refresh I don't have other ideas than inspecting all the logs, maybe turn on debug mode as Tobi already suggested. I haven't seen that in any cluster yet, so it's hard to tell. Is any of the hosts shown as "offline" in 'ceph orch host ls' output? I wonder if it could be a connection issue, MTU mismatch, apparmor or firewall... Zitat von Bob Gibson <rjg@xxxxxxxxxx>: Sorry to resurrect this thread, but while I was able to get the cluster healthy again by manually creating the osd, I'm still unable to manage osds using the orchestrator. The orchestrator is generally working, but It appears to be unable to scan devices. Immediately after failing out the mgr `ceph orch device ls` will display device status from >4 weeks ago, which was when we converted the cluster to be managed by cephadm. Eventually the orchestrator will attempt to refresh its device status. At this point `ceph orch device ls` stops displaying any output at all. I can reproduce this state almost immediately if I run `ceph orch device ls —refresh` to force an immediate refresh. The mgr log shows events like the following just before `ceph orch device ls` stops reporting output (one event for every osd node in the cluster): "Detected new or changed devices on ceph-osd31” Here are the osd services in play: # ceph orch ls osd NAME PORTS RUNNING REFRESHED AGE PLACEMENT osd 95 8m ago - <unmanaged> osd.ceph-osd31 4 8m ago 6d ceph-osd31 # ceph orch ls osd --export service_type: osd service_name: osd unmanaged: true spec: filter_logic: AND objectstore: bluestore --- service_type: osd service_id: ceph-osd31 service_name: osd.ceph-osd31 placement: hosts: - ceph-osd31 spec: data_devices: rotational: 0 size: '3TB:' encrypted: true filter_logic: AND objectstore: bluestore I tried deleting the default “osd” service in case it was somehow conflicting with my per-node spec, but it looks like that’s not allowed, so I assume any custom osd service specs override the unmanaged default. # ceph orch rm osd Invalid service 'osd'. Use 'ceph orch ls' to list available services. My hunch is that some persistent state is corrupted, or there’s something else preventing the orchestrator from successfully refreshing its device status, but I don’t know how to troubleshoot this. Any ideas? Cheers, /rjg P.S. @Eugen: When I first started this thread you said it was unnecessary to destroy an osd to convert it from unmanaged to managed. Can you explain how this is done? Although we want to recreate the osds to enable encryption, it would save time, and unnecessary wear on the SSDs, while troubleshooting. On Oct 16, 2024, at 2:45 PM, Eugen Block <eblock@xxxxxx> wrote: EXTERNAL EMAIL | USE CAUTION Glad to hear it worked out for you! Zitat von Bob Gibson <rjg@xxxxxxxxxx>: I’ve been away on vacation and just got back to this. I’m happy to report that manually recreating the OSD with ceph-volume and then adopting it with cephadm fixed the problem. Thanks again for your help Eugen! Cheers, /rjg On Sep 29, 2024, at 10:40 AM, Eugen Block <eblock@xxxxxx> wrote: EXTERNAL EMAIL | USE CAUTION Okay, apparently this is not what I was facing. I see two other options right now. The first would be to purge osd.88 from the crush tree entirely. The second approach would be to create an osd manually with ceph-volume, not cephadm ceph-volume, to create a legacy osd (you'd get warnings about a stray daemon). If that works, adopt the osd with cephadm. I don't have a better idea right now. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx