Hi Adam, thanks a lot for your answer, I have tried the "ceph mgr fail" and the active manager has migrated to a different node but "ceph orch" commands continue to hang. # ceph orch status --verbose ... Submitting command: {'prefix': 'orch status', 'target': ('mon-mgr', '')} submit {"prefix": "orch status", "target": ["mon-mgr", ""]} to mon-mgr I don't know the message system that uses to communicate with the target mon-mgr but it seems the message never gets a response, so it makes sense to think that something is blocked in the mgr but I do not know how to check the mgr internals. On Thu, 16 Sept 2021 at 14:53, Adam King <adking@xxxxxxxxxx> wrote: > Does running "ceph mgr fail" then waiting a bit make the "ceph orch" > commands responsive? That's worked for me sometimes before when they > wouldn't respond. > > On Thu, Sep 16, 2021 at 8:08 AM Javier Cacheiro <Javier.Cacheiro@xxxxxxxxx> > wrote: > >> Hi, >> >> I have configured a ceph cluster with the new Pacific version (16.2.4) >> using cephadm to see how it performed. >> >> Everything went smoothly and the cluster was working fine until I did a >> ordered shutdown and reboot of the nodes and after that all "ceph orch" >> commands hang as if they were not able to contact the cephadm >> orchestrator. >> >> I have seen other people experiencing a similar issue in a past thread >> after a power outage that was resolved restarting the services at each >> host. I have tried that but it did not work. >> >> As in that case, logs also give no clue and they show no errors, all >> dockers are running fine except for rgw but I suspect those are irrelevant >> for this case (of course I could be wrong). >> >> There are no data yet in the cluster, apart from tests, but I would really >> like to find the cause of this issue but I am having a hard time figuring >> out how "ceph orch" contacts the cephadm module of the ceph-mgr explore >> the >> issue in more detail. Any ideas of how to proceed are well appreciated? >> >> Also it would be of great help if you could direct me at where to look at >> the code or any details about how the command line tools contacts the >> cephamd api in the ceph-mgr. >> >> Thanks a lot, >> Javier >> >> Here are the details: >> >> # ceph orch status --verbose >> ... >> Submitting command: {'prefix': 'orch status', 'target': ('mon-mgr', '')} >> submit {"prefix": "orch status", "target": ["mon-mgr", ""]} to mon-mgr >> --> at this point hangs forever (strace shows it is blocked in futex lock) >> >> # ceph status >> >> cluster: >> id: c6e89d30-de52-11eb-a76f-bc97e1e57d70 >> health: HEALTH_WARN >> 8 failed cephadm daemon(s) >> 1 pools have many more objects per pg than average >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >> flag(s) set >> >> services: >> mon: 2 daemons, quorum c26-1,c28-1 (age 105m) >> mgr: c26-1.sojetc(active, since 105m), standbys: c28-1.zmwxro, >> c27-1.ixiiun, c28-38.lpsgmq, c26-40.ltomjc >> osd: 192 osds: 192 up (since 110m), 192 in (since 10w) >> flags >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover >> >> data: >> pools: 9 pools, 7785 pgs >> objects: 24.69k objects, 166 GiB >> usage: 24 TiB used, 2.7 PiB / 2.8 PiB avail >> pgs: 7785 active+clean >> >> NOTE: Actually the cluster had 5 mons running, but in the last test I >> started only two of them and I saw how the others appeared first as no >> available first and then they were automatically removed from the config. >> So even after later starting the other nodes they are no longer being used >> as mons. Interestingly enouth they are still used as mgr. >> >> # ceph health detail >> HEALTH_WARN 8 failed cephadm daemon(s); 1 pools have many more objects per >> pg than average; >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >> [WRN] CEPHADM_FAILED_DAEMON: 8 failed cephadm daemon(s) >> daemon rgw.cesga.c27-35.qfbwai on c27-35 is in error state >> daemon rgw.cesga.c27-35.eelnnx on c27-35 is in error state >> daemon rgw.cesga.c27-35.mihttm on c27-35 is in error state >> daemon rgw.cesga.c27-35.redbiq on c27-35 is in error state >> daemon rgw.cesga.c27-36.igdmae on c27-36 is in error state >> daemon rgw.cesga.c27-36.xrjhxh on c27-36 is in error state >> daemon rgw.cesga.c27-36.rubmyu on c27-36 is in error state >> daemon rgw.cesga.c27-36.swrygg on c27-36 is in error state >> [WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than >> average >> pool glance-images objects per pg (36) is more than 12 times cluster >> average (3) >> [WRN] OSDMAP_FLAGS: >> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set >> >> # ceph mgr module ls >> "always_on_modules": [ >> "balancer", >> "crash", >> "devicehealth", >> "orchestrator", >> "pg_autoscaler", >> "progress", >> "rbd_support", >> "status", >> "telemetry", >> "volumes" >> ], >> "enabled_modules": [ >> "cephadm", >> "dashboard", >> "iostat", >> "prometheus", >> "restful" >> ], >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx