Re: Ceph Pacific mon is not starting after host reboot

Adrian Nicolae <adrian.nicolae@xxxxxxxxxx> · Sun, 23 May 2021 19:32:09 +0300

It's a fresh Pacific install with the default settings on all hosts :

root@node01:/home/adrian# ceph config show-with-defaults mon.node03 | 
grep msgr
mon_warn_on_msgr2_not_enabled true default
ms_bind_msgr1 true default
ms_bind_msgr2 true

On 5/23/2021 5:50 PM, Szabo, Istvan (Agoda) wrote:
Not sure it’s the issue, but it complaina bour msgr not msgr2, do you 
have the v1  amd v2 adresses in the ceph.conf on that specific osds?

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx <mailto:istvan.szabo@xxxxxxxxx>
---------------------------------------------------

On 2021. May 23., at 15:40, Adrian Nicolae 
<adrian.nicolae@xxxxxxxxxx> wrote:

Hi guys,

I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will 
put it in production on a 1PB+ storage cluster with rgw-only access.

I noticed a weird issue with my mons :

- if I reboot a mon host, the ceph-mon container is not starting 
after reboot

- I can see with 'ceph orch ps' the following output :

mon.node01               node01               running (20h)   4m 
ago     20h   16.2.4     8d91d370c2b8 0a2e86af94b2
mon.node02               node02               running (115m)  12s 
ago    115m  16.2.4     8d91d370c2b8 51f4885a1b06
mon.node03               node03 stopped         4m ago     19h   
<unknown> <unknown>     <unknown>

(where node03 is the host which was rebooted).

- I tried to start the mon container manually on node03 with 
'/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' 
and I've got the following output :

debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0 
mon.node03@-1(???).osd e408 crush map has features 
3314933069573799936, adjusting msgr requires
debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0 
mon.node03@-1(???).osd e408 crush map has features 
432629308056666112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0 
mon.node03@-1(???).osd e408 crush map has features 
432629308056666112, adjusting msgr requires
debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0 
mon.node03@-1(???).osd e408 crush map has features 
432629308056666112, adjusting msgr requires
cluster 2021-05-23T08:07:12.189243+0000 mgr.node01.ksitls (mgr.14164) 
36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB 
data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 
15 op/s
debug 2021-05-23T08:24:25.196+0000 7f9a9e358700  1 
mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 
0 -> 3
debug 2021-05-23T08:24:25.208+0000 7f9a88176700  1 heartbeat_map 
reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out 
after 0.000000000s
debug 2021-05-23T08:24:25.208+0000 7f9a9e358700  0 
mon.node03@-1(probing) e5  my rank is now 1 (was -1)
debug 2021-05-23T08:24:25.212+0000 7f9a87975700  0 
mon.node03@1(probing) e6  removed from monmap, suicide.

root@node03:/home/adrian# systemctl status 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph 
mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
     Loaded: loaded 
(/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; 
enabled; vendor preset: enabled)
     Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
    Process: 1176 ExecStart=/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run 
(code=exited, status=0/SUCCESS)
    Process: 1855 ExecStop=/usr/bin/docker stop 
ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, 
status=1/FAILURE)
    Process: 1861 ExecStopPost=/bin/bash 
/var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop 
(code=exited, status=0/SUCCESS)
   Main PID: 1176 (code=exited, status=0/SUCCESS)

The only fix I could find was to redeploy the mon with :

ceph orch daemon rm  mon.node03 --force
ceph orch daemon add mon node03

However, even if it's working after redeploy, it's not giving me a 
lot of trust to use it in a production environment having an issue 
like that.  I could reproduce it with 2 different mons so it's not 
just an exception.

My setup is based on Ubuntu 20.04 and docker instead of podman :

root@node01:~# docker -v
Docker version 20.10.6, build 370c289

Do you know a workaround for this issue or is this a known bug ? I 
noticed that there are some other complaints with the same behaviour 
in Octopus as well and the solution at that time was to delete the 
/var/lib/ceph/mon folder .

Thanks.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

------------------------------------------------------------------------

          This message is confidential and is for the sole use of the
          intended recipient(s). It may also be privileged or
          otherwise protected by copyright or other legal rules. If
          you have received it by mistake please let us know by reply
          email and delete it from your system. It is prohibited to
          copy this message or disclose its content to anyone. Any
          confidentiality or privilege is not waived or lost by any
          mistaken delivery or unauthorized disclosure of the message.
          All messages sent to and from Agoda may be monitored to
          ensure compliance with company policies, to protect the
          company's interests and to remove potential malware.
          Electronic messages may be intercepted, amended, lost or
          deleted, or contain viruses.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx