Does running "ceph mgr fail" then waiting a bit make the "ceph orch" commands responsive? That's worked for me sometimes before when they wouldn't respond. On Thu, Sep 16, 2021 at 8:08 AM Javier Cacheiro <Javier.Cacheiro@xxxxxxxxx> wrote: > Hi, > > I have configured a ceph cluster with the new Pacific version (16.2.4) > using cephadm to see how it performed. > > Everything went smoothly and the cluster was working fine until I did a > ordered shutdown and reboot of the nodes and after that all "ceph orch" > commands hang as if they were not able to contact the cephadm orchestrator. > > I have seen other people experiencing a similar issue in a past thread > after a power outage that was resolved restarting the services at each > host. I have tried that but it did not work. > > As in that case, logs also give no clue and they show no errors, all > dockers are running fine except for rgw but I suspect those are irrelevant > for this case (of course I could be wrong). > > There are no data yet in the cluster, apart from tests, but I would really > like to find the cause of this issue but I am having a hard time figuring > out how "ceph orch" contacts the cephadm module of the ceph-mgr explore the > issue in more detail. Any ideas of how to proceed are well appreciated? > > Also it would be of great help if you could direct me at where to look at > the code or any details about how the command line tools contacts the > cephamd api in the ceph-mgr. > > Thanks a lot, > Javier > > Here are the details: > > # ceph orch status --verbose > ... > Submitting command: {'prefix': 'orch status', 'target': ('mon-mgr', '')} > submit {"prefix": "orch status", "target": ["mon-mgr", ""]} to mon-mgr > --> at this point hangs forever (strace shows it is blocked in futex lock) > > # ceph status > > cluster: > id: c6e89d30-de52-11eb-a76f-bc97e1e57d70 > health: HEALTH_WARN > 8 failed cephadm daemon(s) > 1 pools have many more objects per pg than average > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover > flag(s) set > > services: > mon: 2 daemons, quorum c26-1,c28-1 (age 105m) > mgr: c26-1.sojetc(active, since 105m), standbys: c28-1.zmwxro, > c27-1.ixiiun, c28-38.lpsgmq, c26-40.ltomjc > osd: 192 osds: 192 up (since 110m), 192 in (since 10w) > flags > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover > > data: > pools: 9 pools, 7785 pgs > objects: 24.69k objects, 166 GiB > usage: 24 TiB used, 2.7 PiB / 2.8 PiB avail > pgs: 7785 active+clean > > NOTE: Actually the cluster had 5 mons running, but in the last test I > started only two of them and I saw how the others appeared first as no > available first and then they were automatically removed from the config. > So even after later starting the other nodes they are no longer being used > as mons. Interestingly enouth they are still used as mgr. > > # ceph health detail > HEALTH_WARN 8 failed cephadm daemon(s); 1 pools have many more objects per > pg than average; > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set > [WRN] CEPHADM_FAILED_DAEMON: 8 failed cephadm daemon(s) > daemon rgw.cesga.c27-35.qfbwai on c27-35 is in error state > daemon rgw.cesga.c27-35.eelnnx on c27-35 is in error state > daemon rgw.cesga.c27-35.mihttm on c27-35 is in error state > daemon rgw.cesga.c27-35.redbiq on c27-35 is in error state > daemon rgw.cesga.c27-36.igdmae on c27-36 is in error state > daemon rgw.cesga.c27-36.xrjhxh on c27-36 is in error state > daemon rgw.cesga.c27-36.rubmyu on c27-36 is in error state > daemon rgw.cesga.c27-36.swrygg on c27-36 is in error state > [WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than > average > pool glance-images objects per pg (36) is more than 12 times cluster > average (3) > [WRN] OSDMAP_FLAGS: > pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set > > # ceph mgr module ls > "always_on_modules": [ > "balancer", > "crash", > "devicehealth", > "orchestrator", > "pg_autoscaler", > "progress", > "rbd_support", > "status", > "telemetry", > "volumes" > ], > "enabled_modules": [ > "cephadm", > "dashboard", > "iostat", > "prometheus", > "restful" > ], > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx