Re: cephadm orchestrator not responding after cluster reboot

Adam King <adking@xxxxxxxxxx> · Thu, 16 Sep 2021 08:52:50 -0400

Does running "ceph mgr fail" then waiting a bit make the "ceph orch"
commands responsive? That's worked for me sometimes before when they
wouldn't respond.

On Thu, Sep 16, 2021 at 8:08 AM Javier Cacheiro <Javier.Cacheiro@xxxxxxxxx>
wrote:

> Hi,
>
> I have configured a ceph cluster with the new Pacific version (16.2.4)
> using cephadm to see how it performed.
>
> Everything went smoothly and the cluster was working fine until I did a
> ordered shutdown and reboot of the nodes and after that all "ceph orch"
> commands hang as if they were not able to contact the cephadm orchestrator.
>
> I have seen other people experiencing a similar issue in a past thread
> after a power outage that was resolved restarting the services at each
> host. I have tried that but it did not work.
>
> As in that case, logs also give no clue and they show no errors, all
> dockers are running fine except for rgw but I suspect those are irrelevant
> for this case (of course I could be wrong).
>
> There are no data yet in the cluster, apart from tests, but I would really
> like to find the cause of this issue but I am having a hard time figuring
> out how "ceph orch" contacts the cephadm module of the ceph-mgr explore the
> issue in more detail. Any ideas of how to proceed are well appreciated?
>
> Also it would be of great help if you could direct me at where to look at
> the code or any details about how the command line tools contacts the
> cephamd api in the ceph-mgr.
>
> Thanks a lot,
> Javier
>
> Here are the details:
>
> # ceph orch status --verbose
> ...
> Submitting command:  {'prefix': 'orch status', 'target': ('mon-mgr', '')}
> submit {"prefix": "orch status", "target": ["mon-mgr", ""]} to mon-mgr
> --> at this point hangs forever (strace shows it is blocked in futex lock)
>
> # ceph status
>
>   cluster:
>     id:     c6e89d30-de52-11eb-a76f-bc97e1e57d70
>     health: HEALTH_WARN
>             8 failed cephadm daemon(s)
>             1 pools have many more objects per pg than average
>             pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
> flag(s) set
>
>   services:
>     mon: 2 daemons, quorum c26-1,c28-1 (age 105m)
>     mgr: c26-1.sojetc(active, since 105m), standbys: c28-1.zmwxro,
> c27-1.ixiiun, c28-38.lpsgmq, c26-40.ltomjc
>     osd: 192 osds: 192 up (since 110m), 192 in (since 10w)
>          flags
> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover
>
>   data:
>     pools:   9 pools, 7785 pgs
>     objects: 24.69k objects, 166 GiB
>     usage:   24 TiB used, 2.7 PiB / 2.8 PiB avail
>     pgs:     7785 active+clean
>
> NOTE: Actually the cluster had 5 mons running, but in the last test I
> started only two of them and I saw how the others appeared first as no
> available first and then they were automatically removed from the config.
> So even after later starting the other nodes they are no longer being used
> as mons. Interestingly enouth they are still used as mgr.
>
> # ceph health detail
> HEALTH_WARN 8 failed cephadm daemon(s); 1 pools have many more objects per
> pg than average;
> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set
> [WRN] CEPHADM_FAILED_DAEMON: 8 failed cephadm daemon(s)
>     daemon rgw.cesga.c27-35.qfbwai on c27-35 is in error state
>     daemon rgw.cesga.c27-35.eelnnx on c27-35 is in error state
>     daemon rgw.cesga.c27-35.mihttm on c27-35 is in error state
>     daemon rgw.cesga.c27-35.redbiq on c27-35 is in error state
>     daemon rgw.cesga.c27-36.igdmae on c27-36 is in error state
>     daemon rgw.cesga.c27-36.xrjhxh on c27-36 is in error state
>     daemon rgw.cesga.c27-36.rubmyu on c27-36 is in error state
>     daemon rgw.cesga.c27-36.swrygg on c27-36 is in error state
> [WRN] MANY_OBJECTS_PER_PG: 1 pools have many more objects per pg than
> average
>     pool glance-images objects per pg (36) is more than 12 times cluster
> average (3)
> [WRN] OSDMAP_FLAGS:
> pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set
>
> # ceph mgr module ls
>     "always_on_modules": [
>         "balancer",
>         "crash",
>         "devicehealth",
>         "orchestrator",
>         "pg_autoscaler",
>         "progress",
>         "rbd_support",
>         "status",
>         "telemetry",
>         "volumes"
>     ],
>     "enabled_modules": [
>         "cephadm",
>         "dashboard",
>         "iostat",
>         "prometheus",
>         "restful"
>     ],
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx