Re: Ceph Pacific mon is not starting after host reboot

David Orman <ormandj@xxxxxxxxxxxx> · Thu, 12 Aug 2021 08:50:17 -0500

https://github.com/ceph/ceph/pull/42690 looks like it might be a fix,
but it's pending review.

On Thu, Aug 12, 2021 at 7:46 AM André Gemünd
<andre.gemuend@xxxxxxxxxxxxxxxxxx> wrote:
>
> We're seeing the same here with v16.2.5 on CentOS 8.3
>
> Do you know of any progress?
>
> Best Greetings
> André
>
> ----- Am 9. Aug 2021 um 18:15 schrieb David Orman ormandj@xxxxxxxxxxxx:
>
> > Hi,
> >
> > We are seeing very similar behavior on 16.2.5, and also have noticed
> > that an undeploy/deploy cycle fixes things. Before we go rummaging
> > through the source code trying to determine the root cause, has
> > anybody else figured this out? It seems odd that a repeatable issue
> > (I've seen other mailing list posts about this same issue) impacting
> > 16.2.4/16.2.5, at least, on reboots hasn't been addressed yet, so
> > wanted to check.
> >
> > Here's one of the other thread titles that appears related:
> > " mons assigned via orch label 'committing suicide' upon
> > reboot."
> >
> > Respectfully,
> > David
> >
> >
> > On Sun, May 23, 2021 at 3:40 AM Adrian Nicolae
> > <adrian.nicolae@xxxxxxxxxx> wrote:
> >>
> >> Hi guys,
> >>
> >> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put
> >> it in production on a 1PB+ storage cluster with rgw-only access.
> >>
> >> I noticed a weird issue with my mons :
> >>
> >> - if I reboot a mon host, the ceph-mon container is not starting after
> >> reboot
> >>
> >> - I can see with 'ceph orch ps' the following output :
> >>
> >> mon.node01               node01               running (20h)   4m ago
> >> 20h   16.2.4     8d91d370c2b8  0a2e86af94b2
> >> mon.node02               node02               running (115m)  12s ago
> >> 115m  16.2.4     8d91d370c2b8  51f4885a1b06
> >> mon.node03               node03               stopped         4m ago
> >> 19h   <unknown>  <unknown>     <unknown>
> >>
> >> (where node03 is the host which was rebooted).
> >>
> >> - I tried to start the mon container manually on node03 with '/bin/bash
> >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run'
> >> and I've got the following output :
> >>
> >> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0
> >> mon.node03@-1(???).osd e408 crush map has features 3314933069573799936,
> >> adjusting msgr requires
> >> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0
> >> mon.node03@-1(???).osd e408 crush map has features 432629308056666112,
> >> adjusting msgr requires
> >> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0
> >> mon.node03@-1(???).osd e408 crush map has features 432629308056666112,
> >> adjusting msgr requires
> >> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700  0
> >> mon.node03@-1(???).osd e408 crush map has features 432629308056666112,
> >> adjusting msgr requires
> >> cluster 2021-05-23T08:07:12.189243+0000 mgr.node01.ksitls (mgr.14164)
> >> 36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB
> >> data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s
> >> debug 2021-05-23T08:24:25.196+0000 7f9a9e358700  1
> >> mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3
> >> debug 2021-05-23T08:24:25.208+0000 7f9a88176700  1 heartbeat_map
> >> reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out
> >> after 0.000000000s
> >> debug 2021-05-23T08:24:25.208+0000 7f9a9e358700  0
> >> mon.node03@-1(probing) e5  my rank is now 1 (was -1)
> >> debug 2021-05-23T08:24:25.212+0000 7f9a87975700  0 mon.node03@1(probing)
> >> e6  removed from monmap, suicide.
> >>
> >> root@node03:/home/adrian# systemctl status
> >> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service
> >> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph
> >> mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3
> >>       Loaded: loaded
> >> (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service;
> >> enabled; vendor preset: enabled)
> >>       Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago
> >>      Process: 1176 ExecStart=/bin/bash
> >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run
> >> (code=exited, status=0/SUCCESS)
> >>      Process: 1855 ExecStop=/usr/bin/docker stop
> >> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited,
> >> status=1/FAILURE)
> >>      Process: 1861 ExecStopPost=/bin/bash
> >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop
> >> (code=exited, status=0/SUCCESS)
> >>     Main PID: 1176 (code=exited, status=0/SUCCESS)
> >>
> >> The only fix I could find was to redeploy the mon with :
> >>
> >> ceph orch daemon rm  mon.node03 --force
> >> ceph orch daemon add mon node03
> >>
> >> However, even if it's working after redeploy, it's not giving me a lot
> >> of trust to use it in a production environment having an issue like
> >> that.  I could reproduce it with 2 different mons so it's not just an
> >> exception.
> >>
> >> My setup is based on Ubuntu 20.04 and docker instead of podman :
> >>
> >> root@node01:~# docker -v
> >> Docker version 20.10.6, build 370c289
> >>
> >> Do you know a workaround for this issue or is this a known bug ? I
> >> noticed that there are some other complaints with the same behaviour in
> >> Octopus as well and the solution at that time was to delete the
> >> /var/lib/ceph/mon folder .
> >>
> >>
> >> Thanks.
> >>
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> --
> Dipl.-Inf. André Gemünd, Leiter IT / Head of IT
> Fraunhofer-Institute for Algorithms and Scientific Computing
> andre.gemuend@xxxxxxxxxxxxxxxxxx
> Tel: +49 2241 14-4199
> /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx