Re: ceph orch status hangs forever

Eugen Block <eblock@xxxxxx> · Sat, 22 May 2021 07:38:33 +0000

Hi,

In the end we found that running "sudo systemctl restart ceph.target" on
each Ceph node, one by one, monitoring the health of the Cluster with "ceph
status" on a separate terminal has been the solution. After restarting
everything all commands are now working fine, and the associated OpenStack
services and VMs also came back to life.

well, that was easy ;-) I thought you already tried that because your  
ceph status reported an uptime of only a week for a couple of daemons.  
Anyway, great that it's responsive again.

Zitat von Sebastian Luna Valero <sebastian.luna.valero@xxxxxxxxx>:

Hi Eugen,

Thank you very much for your help!

In the end we found that running "sudo systemctl restart ceph.target" on
each Ceph node, one by one, monitoring the health of the Cluster with "ceph
status" on a separate terminal has been the solution. After restarting
everything all commands are now working fine, and the associated OpenStack
services and VMs also came back to life.

I hope this helps somebody else!

Best regards,
Sebastian

On Fri, 21 May 2021 at 11:19, Eugen Block <eblock@xxxxxx> wrote:

Hi,

> But we are not sure if we can enable some of them. Now all the logs
> we have from Ceph are not showing errors. Would it help to see more
> logs to enable some of those modules?

I would not enable more modules, that could make it worse. Instead you
could try to disable diskprediction_local module. But first I would
stop those hanging containers (docker/podman stop <ID>) on all
affected hosts, then maybe restart the mgr daemons one by one and see
if this helps.

Zitat von ManuParra <mparra@xxxxxx>:

> Hi Eugen, this is the output: ceph mgr module ls
>
> {
>     "always_on_modules": [
>         "balancer",
>         "crash",
>         "devicehealth",
>         "orchestrator",
>         "pg_autoscaler",
>         "progress",
>         "rbd_support",
>         "status",
>         "telemetry",
>         "volumes"
>     ],
>     "enabled_modules": [
>         "cephadm",
>         "dashboard",
>         "diskprediction_local",
>         "iostat",
>         "prometheus",
>         "restful"
>     ],
>     "disabled_modules": [
> ...
> }
>
> As you see balancer/crash/… are in section always_on. I checked it
> on all of the 3 monitor nodes with the same output.
>
> Then, checking disabled_modules I’ve seen many modules that could
> help to track some more information (logs) on our problem, like:
> - alerts
> - insights
> - test_orchestrator
> - and other…
>
> But we are not sure if we can enable some of them. Now all the logs
> we have from Ceph are not showing errors. Would it help to see more
> logs to enable some of those modules?
>
> On the other hand, as for what you indicate us of the commands that
> hang, we see that the containers that launch it remain executing,
> since they are waiting to finish. So you can see here the list of
> tests that remain running (hung):
>
>
> af13bda77a1a   172.16.3.146:4000/ceph/ceph:v15.2.9
>                          "ceph osd status"        23 hours ago    Up
> 23 hours             wizardly_leavitt
> 5b5c760454c7   172.16.3.146:4000/ceph/ceph:v15.2.9
>                          "ceph telemetry stat…"   24 hours ago    Up
> 24 hours             intelligent_bardeen
> a98e6061489d   172.16.3.146:4000/ceph/ceph:v15.2.9
>                          "ceph service dump"      24 hours ago    Up
> 24 hours             romantic_mendel
> 66c943a032f8   172.16.3.146:4000/ceph/ceph:v15.2.9
>                          "ceph service status"    24 hours ago    Up
> 24 hours             happy_shannon
> 7e18899dffc5   172.16.3.146:4000/ceph/ceph:v15.2.9
>                          "ceph crash stat"        24 hours ago    Up
> 24 hours             xenodochial_germain
> 8268082e753b   172.16.3.146:4000/ceph/ceph:v15.2.9
>                          "ceph crash ls"          24 hours ago    Up
> 24 hours             stoic_volhard
> fc5c434a4e23   172.16.3.146:4000/ceph/ceph:v15.2.9
>                          "ceph balancer status"   24 hours ago    Up
> 24 hours             epic_mendel
>
> So the containers will have to be eliminated.
>
> As for the logs of these containers nothing appears inside the
> container (docker logs xxxx), only, when you kill it you can see
> (--verbose):
> [ceph: root@spsrc-mon-1 /]# ceph —verbose pg stat
> ….
> validate_command: pg stat
> better match: 2.5 > 0: pg stat
> bestcmds_sorted:
> [{'flags': 8,
>   'help': 'show placement group status.',
>   'module': 'pg',
>   'perm': 'r',
>   'sig': [argdesc(<class 'ceph_argparse.CephPrefix'>, req=True,
> name=prefix, n=1, numseen=0, prefix=pg),
>           argdesc(<class 'ceph_argparse.CephPrefix'>, req=True,
> name=prefix, n=1, numseen=0, prefix=stat)]}]
> Submitting command:  {'prefix': 'pg stat', 'target': ('mon-mgr', '')}
> submit ['{"prefix": "pg stat", "target": ["mon-mgr", ""]}'] to mon-mgr
> [hung forever …]
>
> Kind regards,
> Manu.
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx