Re: "ceph orch restart mgr" command creates mgr restart loop

"Jens Hyllegaard (Soft Design A/S)" <jens.hyllegaard@xxxxxxxxxxxxx> · Wed, 27 Jan 2021 13:37:53 +0000

Hi Chris

Having also recently started exploring Ceph. I too happened upon this problem.
I found that terminating the command witch ctrl-c seemed to stop the looping. Which btw. also happens on all other mgr instances in the cluster.

Regards

Jens

-----Original Message-----
From: Chris Read <chris.read@xxxxxxxxx> 
Sent: 11. januar 2021 21:54
To: ceph-users@xxxxxxx
Subject:  "ceph orch restart mgr" command creates mgr restart loop

Greetings all...

I'm busy testing out Ceph and have hit this troublesome bug while following the steps outlined here:

https://docs.ceph.com/en/octopus/cephadm/monitoring/#configuring-ssl-tls-for-grafana

When I issue the "ceph orch restart mgr" command, it appears the command is not cleared from a message queue somewhere (I'm still very unclear on many ceph specifics), and so each time the mgr process returns from restart it picks up the message again and keeps restarting itself forever (so far it's been stuck in this state for 45 minutes).

Watching the logs we see this going on:

$ ceph log last cephadm -w

root@ceph-poc-000:~# ceph log last cephadm -w
  cluster:
    id:     d23bc326-543a-11eb-bfe0-b324db228b6c
    health: HEALTH_OK

  services:
    mon: 5 daemons, quorum
ceph-poc-000,ceph-poc-003,ceph-poc-004,ceph-poc-002,ceph-poc-001 (age 2h)
    mgr: ceph-poc-000.himivo(active, since 4s), standbys:
ceph-poc-001.unjulx
    osd: 10 osds: 10 up (since 2h), 10 in (since 2h)

  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   10 GiB used, 5.4 TiB / 5.5 TiB avail
    pgs:     1 active+clean

2021-01-11T20:46:32.976606+0000 mon.ceph-poc-000 [INF] Active manager daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:32.980749+0000 mon.ceph-poc-000 [INF] Activating manager daemon ceph-poc-000.himivo
2021-01-11T20:46:33.061519+0000 mon.ceph-poc-000 [INF] Manager daemon ceph-poc-000.himivo is now available
2021-01-11T20:46:39.156420+0000 mon.ceph-poc-000 [INF] Active manager daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:39.160618+0000 mon.ceph-poc-000 [INF] Activating manager daemon ceph-poc-000.himivo
2021-01-11T20:46:39.242603+0000 mon.ceph-poc-000 [INF] Manager daemon ceph-poc-000.himivo is now available
2021-01-11T20:46:45.299953+0000 mon.ceph-poc-000 [INF] Active manager daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:45.304006+0000 mon.ceph-poc-000 [INF] Activating manager daemon ceph-poc-000.himivo
2021-01-11T20:46:45.733495+0000 mon.ceph-poc-000 [INF] Manager daemon ceph-poc-000.himivo is now available
2021-01-11T20:46:51.871903+0000 mon.ceph-poc-000 [INF] Active manager daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:51.877107+0000 mon.ceph-poc-000 [INF] Activating manager daemon ceph-poc-000.himivo
2021-01-11T20:46:51.976190+0000 mon.ceph-poc-000 [INF] Manager daemon ceph-poc-000.himivo is now available
2021-01-11T20:46:58.000720+0000 mon.ceph-poc-000 [INF] Active manager daemon ceph-poc-000.himivo restarted
2021-01-11T20:46:58.006843+0000 mon.ceph-poc-000 [INF] Activating manager daemon ceph-poc-000.himivo
2021-01-11T20:46:58.097163+0000 mon.ceph-poc-000 [INF] Manager daemon ceph-poc-000.himivo is now available
2021-01-11T20:47:04.188630+0000 mon.ceph-poc-000 [INF] Active manager daemon ceph-poc-000.himivo restarted
2021-01-11T20:47:04.193501+0000 mon.ceph-poc-000 [INF] Activating manager daemon ceph-poc-000.himivo
2021-01-11T20:47:04.285509+0000 mon.ceph-poc-000 [INF] Manager daemon ceph-poc-000.himivo is now available
2021-01-11T20:47:10.348099+0000 mon.ceph-poc-000 [INF] Active manager daemon ceph-poc-000.himivo restarted
2021-01-11T20:47:10.352340+0000 mon.ceph-poc-000 [INF] Activating manager daemon ceph-poc-000.himivo
2021-01-11T20:47:10.752243+0000 mon.ceph-poc-000 [INF] Manager daemon ceph-poc-000.himivo is now available

And in the logs for the mgr instance itself we see it keep replaying the message over and over:

$ docker logs -f
ceph-d23bc326-543a-11eb-bfe0-b324db228b6c-mgr.ceph-poc-000.himivo
debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 set uid:gid to 167:167
(ceph:ceph)
debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 ceph version 15.2.8
(bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process ceph-mgr, pid 1 debug 2021-01-11T20:47:31.390+0000 7f48b0d0d200  0 pidfile_write: ignore empty --pid-file debug 2021-01-11T20:47:31.414+0000 7f48b0d0d200  1 mgr[py] Loading python module 'alerts'
debug 2021-01-11T20:47:31.486+0000 7f48b0d0d200  1 mgr[py] Loading python module 'balancer'
debug 2021-01-11T20:47:31.542+0000 7f48b0d0d200  1 mgr[py] Loading python module 'cephadm'
debug 2021-01-11T20:47:31.742+0000 7f48b0d0d200  1 mgr[py] Loading python module 'crash'
debug 2021-01-11T20:47:31.798+0000 7f48b0d0d200  1 mgr[py] Loading python module 'dashboard'
debug 2021-01-11T20:47:32.258+0000 7f48b0d0d200  1 mgr[py] Loading python module 'devicehealth'
debug 2021-01-11T20:47:32.306+0000 7f48b0d0d200  1 mgr[py] Loading python module 'diskprediction_local'
debug 2021-01-11T20:47:32.498+0000 7f48b0d0d200  1 mgr[py] Loading python module 'influx'
debug 2021-01-11T20:47:32.550+0000 7f48b0d0d200  1 mgr[py] Loading python module 'insights'
debug 2021-01-11T20:47:32.598+0000 7f48b0d0d200  1 mgr[py] Loading python module 'iostat'
debug 2021-01-11T20:47:32.642+0000 7f48b0d0d200  1 mgr[py] Loading python module 'k8sevents'
debug 2021-01-11T20:47:33.034+0000 7f48b0d0d200  1 mgr[py] Loading python module 'localpool'
debug 2021-01-11T20:47:33.082+0000 7f48b0d0d200  1 mgr[py] Loading python module 'orchestrator'
debug 2021-01-11T20:47:33.266+0000 7f48b0d0d200  1 mgr[py] Loading python module 'osd_support'
debug 2021-01-11T20:47:33.310+0000 7f48b0d0d200  1 mgr[py] Loading python module 'pg_autoscaler'
debug 2021-01-11T20:47:33.382+0000 7f48b0d0d200  1 mgr[py] Loading python module 'progress'
debug 2021-01-11T20:47:33.442+0000 7f48b0d0d200  1 mgr[py] Loading python module 'prometheus'
debug 2021-01-11T20:47:33.794+0000 7f48b0d0d200  1 mgr[py] Loading python module 'rbd_support'
debug 2021-01-11T20:47:33.870+0000 7f48b0d0d200  1 mgr[py] Loading python module 'restful'
debug 2021-01-11T20:47:34.086+0000 7f48b0d0d200  1 mgr[py] Loading python module 'rook'
debug 2021-01-11T20:47:34.606+0000 7f48b0d0d200  1 mgr[py] Loading python module 'selftest'
debug 2021-01-11T20:47:34.654+0000 7f48b0d0d200  1 mgr[py] Loading python module 'status'
debug 2021-01-11T20:47:34.710+0000 7f48b0d0d200  1 mgr[py] Loading python module 'telegraf'
debug 2021-01-11T20:47:34.762+0000 7f48b0d0d200  1 mgr[py] Loading python module 'telemetry'
debug 2021-01-11T20:47:35.034+0000 7f48b0d0d200  1 mgr[py] Loading python module 'test_orchestrator'
debug 2021-01-11T20:47:35.146+0000 7f48b0d0d200  1 mgr[py] Loading python module 'volumes'
debug 2021-01-11T20:47:35.294+0000 7f48b0d0d200  1 mgr[py] Loading python module 'zabbix'
debug 2021-01-11T20:47:35.366+0000 7f489df86700  0 ms_deliver_dispatch:
unhandled message 0x55d57b71a420 mon_map magic: 0 v1 from mon.0 v2:
10.128.18.16:3300/0
debug 2021-01-11T20:47:35.758+0000 7f489df86700  1 mgr handle_mgr_map Activating!
debug 2021-01-11T20:47:35.758+0000 7f489df86700  1 mgr handle_mgr_map I am now activating debug 2021-01-11T20:47:35.770+0000 7f484ca14700  0 [balancer DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.770+0000 7f484ca14700  1 mgr load Constructed class from module: balancer debug 2021-01-11T20:47:35.774+0000 7f484ca14700  0 [cephadm DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.874+0000 7f484ca14700  1 mgr load Constructed class from module: cephadm debug 2021-01-11T20:47:35.874+0000 7f484ca14700  0 [crash DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.874+0000 7f484ca14700  1 mgr load Constructed class from module: crash debug 2021-01-11T20:47:35.878+0000 7f484ca14700  0 [dashboard DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.878+0000 7f484ca14700  1 mgr load Constructed class from module: dashboard debug 2021-01-
 11T20:47:35.878+0000 7f484ca14700  0 [devicehealth DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.878+0000 7f484ca14700  1 mgr load Constructed class from module: devicehealth debug 2021-01-11T20:47:35.878+0000 7f484ca14700  0 [iostat DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.878+0000 7f484ca14700  1 mgr load Constructed class from module: iostat debug 2021-01-11T20:47:35.886+0000 7f484ca14700  0 [orchestrator DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.886+0000 7f484ca14700  1 mgr load Constructed class from module: orchestrator debug 2021-01-11T20:47:35.890+0000 7f484ca14700  0 [pg_autoscaler DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.890+0000 7f484ca14700  1 mgr load Constructed class from module: pg_autoscaler debug 2021-01-11T20:47:35.890+0000 7f484ca14700  0 [progress DEBUG root] setting log leve
 l based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.890+0000 7f484ca14700  1 mgr load Constructed class from module: progress debug 2021-01-11T20:47:35.898+0000 7f484ca14700  0 [prometheus DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.910+0000 7f484ca14700  1 mgr load Constructed class from module: prometheus debug 2021-01-11T20:47:35.914+0000 7f484ca14700  0 [rbd_support DEBUG root] setting log level based on debug_mgr: WARNING (1/5) [11/Jan/2021:20:47:35] ENGINE Bus STARTING
Warning: Permanently added 'ceph-poc-000,10.128.18.16' (ECDSA) to the list of known hosts.
Warning: Permanently added 'ceph-poc-002,10.128.19.248' (ECDSA) to the list of known hosts.
debug 2021-01-11T20:47:35.974+0000 7f484ca14700  1 mgr load Constructed class from module: rbd_support
Warning: Permanently added 'ceph-poc-001,10.128.28.223' (ECDSA) to the list of known hosts.
debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [restful DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.982+0000 7f484ca14700  1 mgr load Constructed class from module: restful debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [status DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.982+0000 7f484ca14700  1 mgr load Constructed class from module: status debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [telemetry DEBUG root] setting log level based on debug_mgr: WARNING (1/5) debug 2021-01-11T20:47:35.982+0000 7f484ca14700  1 mgr load Constructed class from module: telemetry debug 2021-01-11T20:47:35.982+0000 7f486f3cc700  0 [restful WARNING root] server not running: no certificate configured debug 2021-01-11T20:47:35.982+0000 7f484ca14700  0 [volumes DEBUG root] setting log level based on debug_mgr: WARNING (1/5)
Warning: Permanently added 'ceph-poc-004,10.128.31.60' (ECDSA) to the list of known hosts.
debug 2021-01-11T20:47:35.994+0000 7f484ca14700  1 mgr load Constructed class from module: volumes
Warning: Permanently added 'ceph-poc-003,10.128.30.127' (ECDSA) to the list of known hosts.
CherryPy Checker:
The Application mounted at '' has an empty config.

[11/Jan/2021:20:47:36] ENGINE Serving on http://:::9283 [11/Jan/2021:20:47:36] ENGINE Bus STARTED debug 2021-01-11T20:47:36.778+0000 7f484d215700  0 log_channel(audit) log [DBG] : from='client.19140 -' entity='client.admin' cmd=[{"prefix": "orch",
"action": "restart", "service_name": "mgr", "target": ["mon-mgr", ""]}]:
dispatch
debug 2021-01-11T20:47:36.778+0000 7f484ca14700  0 log_channel(cephadm) log [INF] : Restart service mgr debug 2021-01-11T20:47:36.806+0000 7f484d215700  0 log_channel(cluster) log [DBG] : pgmap v3: 1 pgs: 1 active+clean; 0 B data, 208 MiB used, 5.4 TiB /
5.5 TiB avail
debug 2021-01-11T20:47:37.406+0000 7f489ef88700 -1 received  signal:
Terminated from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 debug 2021-01-11T20:47:37.406+0000 7f489ef88700 -1 mgr handle_mgr_signal
 *** Got signal Terminated ***

Is there anyway to manually clear this stuck message? Will it eventually timeout by itself?

Either way it's a pretty unpleasant surprise to find when following the docs so early in the process.

Thanks,

Chris
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx