After spending some hours on debugging packets on the wire, without seeing a good reason for things not to work, the monitor on server2 eventually joined the quorum. Being happy for some time and then our alarming sends a message that the quorum is lost. And indeed, the monitor on server2 died and now comes the not so funny part: restarting the monitor makes the cluster hang again. I will post another debug log in the next hours, now from the monitor on server2. Nico Schottelius <nico.schottelius@xxxxxxxxxxx> writes: > Not sure if I mentioned before: adding a new monitor also puts the whole > cluster into stuck state. > > Some minutes ago I did: > > root@server1:~# ceph mon add server2 2a0a:e5c0::92e2:baff:fe4e:6614 > port defaulted to 6789; adding mon.server2 at [2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0 > > And then started the daemon on server2: > > ceph-mon -i server2 --pid-file /var/lib/ceph/run/mon.server2.pid -c /etc/ceph/ceph.conf --cluster ceph --setuser ceph --setgroup ceph -d 2>&1 | tee ~/cephmonlog-2017-10-08-2 > > And now the cluster hangs (as in ceph -s does not return). > > Looking at mon_status of server5, shows that server5 thinks it is time > for electing [0]. > > When stopping the monitor on server2 and trying to remove server2 again, > the removal command also gets stuck and never returns: > > root@server1:~# ceph mon rm server2 > > As our cluster is now severely degraded, I was wondering if anyone has a > quick hint on how to get ceph -s back working and/or remove server2 > and/or how to readd server1? > > Best, > > Nico > > > [0] > > [10:50:38] server5:~# ceph daemon mon.server5 mon_status > { > "name": "server5", > "rank": 0, > "state": "electing", > "election_epoch": 6087, > "quorum": [], > "features": { > "required_con": "153140804152475648", > "required_mon": [ > "kraken", > "luminous" > ], > "quorum_con": "2305244844532236283", > "quorum_mon": [ > "kraken", > "luminous" > ] > }, > "outside_quorum": [], > "extra_probe_peers": [ > "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0" > ], > "sync_provider": [], > "monmap": { > "epoch": 11, > "fsid": "26c0c5a8-d7ce-49ac-b5a7-bfd9d0ba81ab", > "modified": "2017-10-08 10:43:49.667986", > "created": "2017-05-16 22:33:04.500528", > "features": { > "persistent": [ > "kraken", > "luminous" > ], > "optional": [] > }, > "mons": [ > { > "rank": 0, > "name": "server5", > "addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0", > "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a3a2]:6789/0" > }, > { > "rank": 1, > "name": "server3", > "addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0", > "public_addr": "[2a0a:e5c0::21b:21ff:fe85:a42a]:6789/0" > }, > { > "rank": 2, > "name": "server2", > "addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0", > "public_addr": "[2a0a:e5c0::92e2:baff:fe4e:6614]:6789/0" > }, > { > "rank": 3, > "name": "server1", > "addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0", > "public_addr": "[2a0a:e5c0::92e2:baff:fe8a:2e78]:6789/0" > } > ] > }, > "feature_map": { > "mon": { > "group": { > "features": "0x1ffddff8eea4fffb", > "release": "luminous", > "num": 1 > } > }, > "client": { > "group": { > "features": "0x1ffddff8eea4fffb", > "release": "luminous", > "num": 4 > } > } > } > } > > > > > Nico Schottelius <nico.schottelius@xxxxxxxxxxx> writes: > >> Good evening Joao, >> >> we double checked our MTUs, they are all 9200 on the servers and 9212 on >> the switches. And we have no problems transferring big files in general >> (as opennebula copies around images for importing, we do this quite a >> lot). >> >> So if you could have a look, it would be much appreciated. >> >> If we should collect other logs, just let us know. >> >> Best, >> >> Nico >> >> Joao Eduardo Luis <joao@xxxxxxx> writes: >> >>> On 10/04/2017 09:19 PM, Gregory Farnum wrote: >>>> Oh, hmm, you're right. I see synchronization starts but it seems to >>>> progress very slowly, and it certainly doesn't complete in that 2.5 >>>> minute logging window. I don't see any clear reason why it's so >>>> slow; it might be more clear if you could provide logs of the other >>>> logs at the same time (especially since you now say they are getting >>>> stuck in the electing state during that period). Perhaps Kefu or >>>> Joao will have some clearer idea what the problem is. >>>> -Greg >>> >>> I haven't gone through logs yet (maybe Friday, it's late today and >>> it's a holiday tomorrow), but not so long ago I seem to recall someone >>> having a similar issue with the monitors that was solely related to a >>> switch's MTU being too small. >>> >>> Maybe that could be the case? If not, I'll take a look at the logs as >>> soon as possible. >>> >>> -Joao >>> >>>> >>>> On Wed, Oct 4, 2017 at 1:04 PM Nico Schottelius >>>> <nico.schottelius@xxxxxxxxxxx <mailto:nico.schottelius@xxxxxxxxxxx>> >>>> wrote: >>>> >>>> >>>> Some more detail: >>>> >>>> when restarting the monitor on server1, it stays in synchronizing state >>>> forever. >>>> >>>> However the other two monitors change into electing state. >>>> >>>> I have double checked that there are not (host) firewalls active and >>>> that the times are within 1 second different of the hosts (they all have >>>> ntpd running). >>>> >>>> We are running everything on IPv6, but this should not be a problem, >>>> should it? >>>> >>>> Best, >>>> >>>> Nico >>>> >>>> >>>> Nico Schottelius <nico.schottelius@xxxxxxxxxxx >>>> <mailto:nico.schottelius@xxxxxxxxxxx>> writes: >>>> >>>> > Hello Gregory, >>>> > >>>> > the logfile I produced has already debug mon = 20 set: >>>> > >>>> > [21:03:51] server1:~# grep "debug mon" /etc/ceph/ceph.conf >>>> > debug mon = 20 >>>> > >>>> > It is clear that server1 is out of quorum, however how do we make it >>>> > being part of the quorum again? >>>> > >>>> > I expected that the quorum finding process is triggered automatically >>>> > after restarting the monitor, or is that incorrect? >>>> > >>>> > Best, >>>> > >>>> > Nico >>>> > >>>> > >>>> > Gregory Farnum <gfarnum@xxxxxxxxxx <mailto:gfarnum@xxxxxxxxxx>> >>>> writes: >>>> > >>>> >> You'll need to change the config so that it's running "debug mon >>>> = 20" for >>>> >> the log to be very useful here. It does say that it's dropping >>>> client >>>> >> connections because it's been out of quorum for too long, which >>>> is the >>>> >> correct behavior in general. I'd imagine that you've got clients >>>> trying to >>>> >> connect to the new monitor instead of the ones already in the >>>> quorum and >>>> >> not passing around correctly; this is all configurable. >>>> >> >>>> >> On Wed, Oct 4, 2017 at 4:09 AM Nico Schottelius < >>>> >> nico.schottelius@xxxxxxxxxxx >>>> <mailto:nico.schottelius@xxxxxxxxxxx>> wrote: >>>> >> >>>> >>> >>>> >>> Good morning, >>>> >>> >>>> >>> we have recently upgraded our kraken cluster to luminous and >>>> since then >>>> >>> noticed an odd behaviour: we cannot add a monitor anymore. >>>> >>> >>>> >>> As soon as we start a new monitor (server2), ceph -s and ceph >>>> -w start to >>>> >>> hang. >>>> >>> >>>> >>> The situation became worse, since one of our staff stopped an >>>> existing >>>> >>> monitor (server1), as restarting that monitor results in the same >>>> >>> situation, ceph -s hangs until we stop the monitor again. >>>> >>> >>>> >>> We kept the monitor running for some minutes, but the situation >>>> never >>>> >>> cleares up. >>>> >>> >>>> >>> The network does not have any firewall in between the nodes and >>>> there >>>> >>> are no host firewalls. >>>> >>> >>>> >>> I have attached the output of the monitor on server1, running in >>>> >>> foreground using >>>> >>> >>>> >>> root@server1:~# ceph-mon -i server1 --pid-file >>>> >>> /var/lib/ceph/run/mon.server1.pid -c /etc/ceph/ceph.conf >>>> --cluster ceph >>>> >>> --setuser ceph --setgroup ceph -d 2>&1 | tee cephmonlog >>>> >>> >>>> >>> Does anyone see any obvious problem in the attached log? >>>> >>> >>>> >>> Any input or hint would be appreciated! >>>> >>> >>>> >>> Best, >>>> >>> >>>> >>> Nico >>>> >>> >>>> >>> >>>> >>> >>>> >>> -- >>>> >>> Modern, affordable, Swiss Virtual Machines. Visit >>>> www.datacenterlight.ch <http://www.datacenterlight.ch> >>>> >>> _______________________________________________ >>>> >>> ceph-users mailing list >>>> >>> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx> >>>> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>> >>>> >>>> >>>> -- >>>> Modern, affordable, Swiss Virtual Machines. Visit >>>> www.datacenterlight.ch <http://www.datacenterlight.ch> >>>> -- Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com