Not sure if I mentioned before: adding a new monitor also puts the whole cluster into stuck state. Some minutes ago I did: root@server1:~# ceph mon add server2 2a0a:e5c0::92e2:baff:fe4e:6614 port defaulted to 6789; adding mon.server2 at [2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0 And then started the daemon on server2: ceph-mon -i server2 --pid-file /var/lib/ceph/run/mon.server2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -d 2>&1 | tee ~/cephmonlog-2017-10-08-2 And now the cluster hangs (as in ceph -s does not return). Looking at mon_status of server5, shows that server5 thinks it is time for electing [0]. When stopping the monitor on server2 and trying to remove server2 again, the removal command also gets stuck and never returns: root@server1:~# ceph mon rm server2 As our cluster is now severely degraded, I was wondering if anyone has a quick hint on how to get ceph -s back working and/or remove server2 and/or how to readd server1? Best, Nico [0] [10:50:38] server5:~# ceph daemon mon.server5 mon_status { "name": "server5", "rank": 0, "state": "electing", "election_epoch": 6087, "quorum": [], "features": { "required_con": "153140804152475648", "required_mon": [ "kraken", "luminous" ], "quorum_con": "2305244844532236283", "quorum_mon": [ "kraken", "luminous" ] }, "outside_quorum": [], "extra_probe_peers": [ "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0" ], "sync_provider": [], "monmap": { "epoch": 11, "fsid": "26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab", "modified": "2017-10-08 10:43:49.667986", "created": "2017-05-16 22:33:04.500528", "features": { "persistent": [ "kraken", "luminous" ], "optional": [] }, "mons": [ { "rank": 0, "name": "server5", "addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0", "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0" }, { "rank": 1, "name": "server3", "addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0", "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0" }, { "rank": 2, "name": "server2", "addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0", "public_addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0" }, { "rank": 3, "name": "server1", "addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0", "public_addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0" } ] }, "feature_map": { "mon": { "group": { "features": "0x1ffddff8eea4fffb", "release": "luminous", "num": 1 } }, "client": { "group": { "features": "0x1ffddff8eea4fffb", "release": "luminous", "num": 4 } } } } Nico Schottelius <nico.schottelius@xxxxxxxxxxx> writes: > Good evening Joao, > > we double checked our MTUs, they are all 9200 on the servers and 9212 on > the switches. And we have no problems transferring big files in general > (as opennebula copies around images for importing, we do this quite a > lot). > > So if you could have a look, it would be much appreciated. > > If we should collect other logs, just let us know. > > Best, > > Nico > > Joao Eduardo Luis <joao@xxxxxxx> writes: > >> On 10/04/2017 09:19 PM, Gregory Farnum wrote: >>> Oh, hmm, you're right. I see synchronization starts but it seems to >>> progress very slowly, and it certainly doesn't complete in that 2.5 >>> minute logging window. I don't see any clear reason why it's so >>> slow; it might be more clear if you could provide logs of the other >>> logs at the same time (especially since you now say they are getting >>> stuck in the electing state during that period). Perhaps Kefu or >>> Joao will have some clearer idea what the problem is. >>> -Greg >> >> I haven't gone through logs yet (maybe Friday, it's late today and >> it's a holiday tomorrow), but not so long ago I seem to recall someone >> having a similar issue with the monitors that was solely related to a >> switch's MTU being too small. >> >> Maybe that could be the case? If not, I'll take a look at the logs as >> soon as possible. >> >> -Joao >> >>> >>> On Wed, Oct 4, 2017 at 1:04 PM Nico Schottelius >>> <nico.schottelius@xxxxxxxxxxx <mailto:nico.schottelius@xxxxxxxxxxx>> >>> wrote: >>> >>> >>> Some more detail: >>> >>> when restarting the monitor on server1, it stays in synchronizing state >>> forever. >>> >>> However the other two monitors change into electing state. >>> >>> I have double checked that there are not (host) firewalls active and >>> that the times are within 1 second different of the hosts (they all have >>> ntpd running). >>> >>> We are running everything on IPv6, but this should not be a problem, >>> should it? >>> >>> Best, >>> >>> Nico >>> >>> >>> Nico Schottelius <nico.schottelius@xxxxxxxxxxx >>> <mailto:nico.schottelius@xxxxxxxxxxx>> writes: >>> >>> > Hello Gregory, >>> > >>> > the logfile I produced has already debug mon = 20 set: >>> > >>> > [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf >>> > debug mon = 20 >>> > >>> > It is clear that server1 is out of quorum, however how do we make it >>> > being part of the quorum again? >>> > >>> > I expected that the quorum finding process is triggered automatically >>> > after restarting the monitor, or is that incorrect? >>> > >>> > Best, >>> > >>> > Nico >>> > >>> > >>> > Gregory Farnum <gfarnum@xxxxxxxxxx <mailto:gfarnum@xxxxxxxxxx>> >>> writes: >>> > >>> >> You'll need to change the config so that it's running "debug mon >>> = 20" for >>> >> the log to be very useful here. It does say that it's dropping >>> client >>> >> connections because it's been out of quorum for too long, which >>> is the >>> >> correct behavior in general. I'd imagine that you've got clients >>> trying to >>> >> connect to the new monitor instead of the ones already in the >>> quorum and >>> >> not passing around correctly; this is all configurable. >>> >> >>> >> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius < >>> >> nico.schottelius@xxxxxxxxxxx >>> <mailto:nico.schottelius@xxxxxxxxxxx>> wrote: >>> >> >>> >>> >>> >>> Good morning, >>> >>> >>> >>> we have recently upgraded our kraken cluster to luminous and >>> since then >>> >>> noticed an odd behaviour: we cannot add a monitor anymore. >>> >>> >>> >>> As soon as we start a new monitor (server2), ceph -s and ceph >>> -w start to >>> >>> hang. >>> >>> >>> >>> The situation became worse, since one of our staff stopped an >>> existing >>> >>> monitor (server1), as restarting that monitor results in the same >>> >>> situation, ceph -s hangs until we stop the monitor again. >>> >>> >>> >>> We kept the monitor running for some minutes, but the situation >>> never >>> >>> cleares up. >>> >>> >>> >>> The network does not have any firewall in between the nodes and >>> there >>> >>> are no host firewalls. >>> >>> >>> >>> I have attached the output of the monitor on server1, running in >>> >>> foreground using >>> >>> >>> >>> root@server1:~# ceph-mon -i server1 --pid-file >>> >>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf >>> --cluster ceph >>> >>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog >>> >>> >>> >>> Does anyone see any obvious problem in the attached log? >>> >>> >>> >>> Any input or hint would be appreciated! >>> >>> >>> >>> Best, >>> >>> >>> >>> Nico >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Modern, affordable, Swiss Virtual Machines. Visit >>> www.datacenterlight.ch <http://www.datacenterlight.ch> >>> >>> _______________________________________________ >>> >>> ceph-users mailing list >>> >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >>> >>> -- >>> Modern, affordable, Swiss Virtual Machines. Visit >>> www.datacenterlight.ch <http://www.datacenterlight.ch> >>> -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com