So the orchestrator is aware of that mon is stopped, but not tried to bring it up again. What is the placement of mon shown in “ceph orch ls”? I explicitly set it to all host names (e.g. node01;node02;node03), and haven’t experienced this. > 在 2021年5月24日,00:35,Adrian Nicolae <adrian.nicolae@xxxxxxxxxx> 写道: > > Hi, > > I waited for more than a day on the first mon failure, it didn't resolve automatically. > > I checked with 'ceph status' and also the ceph.conf on that hosts and the failed mon was removed from the monmap. The cluster reported only 2 mons (instead of 3) and the third mon was completely removed from config , it wasn't reported as a failure on 'ceph status'. > > >> On 5/23/2021 7:30 PM, 胡 玮文 wrote: >> Hi Adrian, >> >> I have not tried, but I think it will resolve itself automatically after some minutes. How long have you waited before you do the manual redeploy? >> >> Could you also try “ceph mon dump” to see whether mon.node03 is actually removed from monmap when it failed to start? >> >>>> 在 2021年5月23日,16:40,Adrian Nicolae <adrian.nicolae@xxxxxxxxxx> 写道: >>> >>> Hi guys, >>> >>> I'm testing Ceph Pacific 16.2.4 in my lab before deciding if I will put it in production on a 1PB+ storage cluster with rgw-only access. >>> >>> I noticed a weird issue with my mons : >>> >>> - if I reboot a mon host, the ceph-mon container is not starting after reboot >>> >>> - I can see with 'ceph orch ps' the following output : >>> >>> mon.node01 node01 running (20h) 4m ago 20h 16.2.4 8d91d370c2b8 0a2e86af94b2 >>> mon.node02 node02 running (115m) 12s ago 115m 16.2.4 8d91d370c2b8 51f4885a1b06 >>> mon.node03 node03 stopped 4m ago 19h <unknown> <unknown> <unknown> >>> >>> (where node03 is the host which was rebooted). >>> >>> - I tried to start the mon container manually on node03 with '/bin/bash /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run' and I've got the following output : >>> >>> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700 0 mon.node03@-1(???).osd e408 crush map has features 3314933069573799936, adjusting msgr requires >>> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700 0 mon.node03@-1(???).osd e408 crush map has features 432629308056666112, adjusting msgr requires >>> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700 0 mon.node03@-1(???).osd e408 crush map has features 432629308056666112, adjusting msgr requires >>> debug 2021-05-23T08:24:25.192+0000 7f9a9e358700 0 mon.node03@-1(???).osd e408 crush map has features 432629308056666112, adjusting msgr requires >>> cluster 2021-05-23T08:07:12.189243+0000 mgr.node01.ksitls (mgr.14164) 36380 : cluster [DBG] pgmap v36392: 417 pgs: 417 active+clean; 33 KiB data, 605 MiB used, 651 GiB / 652 GiB avail; 9.6 KiB/s rd, 0 B/s wr, 15 op/s >>> debug 2021-05-23T08:24:25.196+0000 7f9a9e358700 1 mon.node03@-1(???).paxosservice(auth 1..51) refresh upgraded, format 0 -> 3 >>> debug 2021-05-23T08:24:25.208+0000 7f9a88176700 1 heartbeat_map reset_timeout 'Monitor::cpu_tp thread 0x7f9a88176700' had timed out after 0.000000000s >>> debug 2021-05-23T08:24:25.208+0000 7f9a9e358700 0 mon.node03@-1(probing) e5 my rank is now 1 (was -1) >>> debug 2021-05-23T08:24:25.212+0000 7f9a87975700 0 mon.node03@1(probing) e6 removed from monmap, suicide. >>> >>> root@node03:/home/adrian# systemctl status ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service >>> ● ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@mon.node03.service - Ceph mon.node03 for c2d41ac4-baf5-11eb-865d-2dc838a337a3 >>> Loaded: loaded (/etc/systemd/system/ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3@.service; enabled; vendor preset: enabled) >>> Active: inactive (dead) since Sun 2021-05-23 08:10:00 UTC; 16min ago >>> Process: 1176 ExecStart=/bin/bash /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.run (code=exited, status=0/SUCCESS) >>> Process: 1855 ExecStop=/usr/bin/docker stop ceph-c2d41ac4-baf5-11eb-865d-2dc838a337a3-mon.node03 (code=exited, status=1/FAILURE) >>> Process: 1861 ExecStopPost=/bin/bash /var/lib/ceph/c2d41ac4-baf5-11eb-865d-2dc838a337a3/mon.node03/unit.poststop (code=exited, status=0/SUCCESS) >>> Main PID: 1176 (code=exited, status=0/SUCCESS) >>> >>> The only fix I could find was to redeploy the mon with : >>> >>> ceph orch daemon rm mon.node03 --force >>> ceph orch daemon add mon node03 >>> >>> However, even if it's working after redeploy, it's not giving me a lot of trust to use it in a production environment having an issue like that. I could reproduce it with 2 different mons so it's not just an exception. >>> >>> My setup is based on Ubuntu 20.04 and docker instead of podman : >>> >>> root@node01:~# docker -v >>> Docker version 20.10.6, build 370c289 >>> >>> Do you know a workaround for this issue or is this a known bug ? I noticed that there are some other complaints with the same behaviour in Octopus as well and the solution at that time was to delete the /var/lib/ceph/mon folder . >>> >>> >>> Thanks. >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@xxxxxxx >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx