https://github.com/ceph/ceph/pull/42690 looks like it might be a fix, but it's pending review. On Thu, Aug 12, 2021 at 7:46 AM André Gemünd <andre.gemuend@xxxxxxxxxxxxxxxxxx> wrote: > > We're seeing the same here with v16.2.5 on CentOS 8.3 > > Do you know of any progress? > > Best Greetings > André > > ----- Am 9. Aug 2021 um 18:15 schrieb David Orman ormandj@xxxxxxxxxxxx: > > > Hi, > > > > We are seeing very similar behavior on 16.2.5, and also have noticed > > that an undeploy/deploy cycle fixes things. Before we go rummaging > > through the source code trying to determine the root cause, has > > anybody else figured this out? It seems odd that a repeatable issue > > (I've seen other mailing list posts about this same issue) impacting > > 16.2.4/16.2.5, at least, on reboots hasn't been addressed yet, so > > wanted to check. > > > > Here's one of the other thread titles that appears related: > > " mons assigned via orch label 'committing suicide' upon > > reboot." > > > > Respectfully, > > David > > > > > > On Sun, May 23, 2021 at 3:40 AM Adrian Nicolae > > <adrian.nicolae@xxxxxxxxxx> wrote: > >> > >> Hi guys, > >> > >> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put > >> it in production on a 1PB+ storage cluster with rgw-only access. > >> > >> I noticed a weird issue with my mons : > >> > >> - if I reboot a mon host, the ceph-mon container is not starting after > >> reboot > >> > >> - I can see with 'ceph orch ps' the following output : > >> > >> mon.node01 node01 running (20h) 4m ago > >> 20h 16.2.4 8d91d370c2b8 0a2e86af94b2 > >> mon.node02 node02 running (115m) 12s ago > >> 115m 16.2.4 8d91d370c2b8 51f4885a1b06 > >> mon.node03 node03 stopped 4m ago > >> 19h <unknown> <unknown> <unknown> > >> > >> (where node03 is the host which was rebooted). > >> > >> - I tried to start the mon container manually on node03 with '/bin/bash > >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' > >> and I've got the following output : > >> > >> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700 0 > >> mon.node03@-1(???).osd e408 crush map has features 3314933069573799936, > >> adjusting msgr requires > >> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700 0 > >> mon.node03@-1(???).osd e408 crush map has features 432629308056666112, > >> adjusting msgr requires > >> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700 0 > >> mon.node03@-1(???).osd e408 crush map has features 432629308056666112, > >> adjusting msgr requires > >> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700 0 > >> mon.node03@-1(???).osd e408 crush map has features 432629308056666112, > >> adjusting msgr requires > >> cluster 2021-05-23T08:07:12.189243+0000 mgr.node01.ksitls (mgr.14164) > >> 36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB > >> data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s > >> debug 2021-05-23T08:24:25.196+0000 7f9a9e358700 1 > >> mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3 > >> debug 2021-05-23T08:24:25.208+0000 7f9a88176700 1 heartbeat_map > >> reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out > >> after 0.000000000s > >> debug 2021-05-23T08:24:25.208+0000 7f9a9e358700 0 > >> mon.node03@-1(probing) e5 my rank is now 1 (was -1) > >> debug 2021-05-23T08:24:25.212+0000 7f9a87975700 0 mon.node03@1(probing) > >> e6 removed from monmap, suicide. > >> > >> root@node03:/home/adrian# systemctl status > >> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service > >> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph > >> mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3 > >> Loaded: loaded > >> (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; > >> enabled; vendor preset: enabled) > >> Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago > >> Process: 1176 ExecStart=/bin/bash > >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run > >> (code=exited, status=0/SUCCESS) > >> Process: 1855 ExecStop=/usr/bin/docker stop > >> ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, > >> status=1/FAILURE) > >> Process: 1861 ExecStopPost=/bin/bash > >> /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop > >> (code=exited, status=0/SUCCESS) > >> Main PID: 1176 (code=exited, status=0/SUCCESS) > >> > >> The only fix I could find was to redeploy the mon with : > >> > >> ceph orch daemon rm mon.node03 --force > >> ceph orch daemon add mon node03 > >> > >> However, even if it's working after redeploy, it's not giving me a lot > >> of trust to use it in a production environment having an issue like > >> that. I could reproduce it with 2 different mons so it's not just an > >> exception. > >> > >> My setup is based on Ubuntu 20.04 and docker instead of podman : > >> > >> root@node01:~# docker -v > >> Docker version 20.10.6, build 370c289 > >> > >> Do you know a workaround for this issue or is this a known bug ? I > >> noticed that there are some other complaints with the same behaviour in > >> Octopus as well and the solution at that time was to delete the > >> /var/lib/ceph/mon folder . > >> > >> > >> Thanks. > >> > >> > >> > >> > >> > >> > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > Dipl.-Inf. André Gemünd, Leiter IT / Head of IT > Fraunhofer-Institute for Algorithms and Scientific Computing > andre.gemuend@xxxxxxxxxxxxxxxxxx > Tel: +49 2241 14-4199 > /C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx