Hi, I have configured a ceph cluster with the new Pacific version (16.2.4) using cephadm to see how it performed. Everything went smoothly and the cluster was working fine until I did a ordered shutdown and reboot of the nodes and after that all "ceph orch" commands hang as if they were not able to contact the cephadm orchestrator. I have seen other people experiencing a similar issue in a past thread after a power outage that was resolved restarting the services at each host. I have tried that but it did not work. As in that case, logs also give no clue and they show no errors, all dockers are running fine except for rgw but I suspect those are irrelevant for this case (of course I could be wrong). There are no data yet in the cluster, apart from tests, but I would really like to find the cause of this issue but I am having a hard time figuring out how "ceph orch" contacts the cephadm module of the ceph-mgr explore the issue in more detail. Any ideas of how to proceed are well appreciated? Also it would be of great help if you could direct me at where to look at the code or any details about how the command line tools contacts the cephamd api in the ceph-mgr. Thanks a lot, Javier Here are the details: # ceph orch status --verbose ... Submitting command: {'prefix': 'orch status', 'target': ('mon-mgr', '')} submit {"prefix": "orch status", "target": ["mon-mgr", ""]} to mon-mgr --> at this point hangs forever (strace shows it is blocked in futex lock) # ceph status cluster: id: c6e89d30-de52-11eb-a76f-bc97e1e57d70 health: HEALTH_WARN 8 failed cephadm daemon(s) 1 pools have many more objects per pg than average pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set services: mon: 2 daemons, quorum c26-1,c28-1 (age 105m) mgr: c26-1.sojetc(active, since 105m), standbys: c28-1.zmwxro, c27-1.ixiiun, c28-38.lpsgmq, c26-40.ltomjc osd: 192 osds: 192 up (since 110m), 192 in (since 10w) flags pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover data: pools: 9 pools, 7785 pgs objects: 24.69k objects, 166 GiB usage: 24 TiB used, 2.7 PiB / 2.8 PiB avail pgs: 7785 active+clean NOTE: Actually the cluster had 5 mons running, but in the last test I started only two of them and I saw how the others appeared first as no available first and then they were automatically removed from the config. So even after later starting the other nodes they are no longer being used as mons. Interestingly enouth they are still used as mgr. # ceph health detail HEALTH_WARN 8 failed cephadm daemon(s); 1 pools have many more objects per pg than average; pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set [WRN] CEPHADM_FAILED_DAEMON: 8 failed cephadm daemon(s) daemon rgw.cesga.c27-35.qfbwai on c27-35 is in error state daemon rgw.cesga.c27-35.eelnnx on c27-35 is in error state daemon rgw.cesga.c27-35.mihttm on c27-35 is in error state daemon rgw.cesga.c27-35.redbiq on c27-35 is in error state daemon rgw.cesga.c27-36.igdmae on c27-36 is in error state daemon rgw.cesga.c27-36.xrjhxh on c27-36 is in error state daemon rgw.cesga.c27-36.rubmyu on c27-36 is in error state daemon rgw.cesga.c27-36.swrygg on c27-36 is in error state [WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than average pool glance-images objects per pg (36) is more than 12 times cluster average (3) [WRN] OSDMAP_FLAGS: pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set # ceph mgr module ls "always_on_modules": [ "balancer", "crash", "devicehealth", "orchestrator", "pg_autoscaler", "progress", "rbd_support", "status", "telemetry", "volumes" ], "enabled_modules": [ "cephadm", "dashboard", "iostat", "prometheus", "restful" ], _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx