Cluster became unresponsive: e5 handle_auth_request failed to assign global_id

Илья Борисович Волошин <i.voloshin@xxxxxxxxxxxxxxxxxx> · Mon, 27 Jul 2020 17:00:01 +0300

Hello,

I've created an Octopus 15.2.4 cluster with 3 monitors and 3 OSDs (6 hosts
in total, all ESXi VMs). It lived through a couple of reboots without
problem, then I've reconfigured the main host a bit:
set iptables-legacy as current option in update-alternatives (this is a
Debian10 system), applied a basic ruleset of iptables and restarted docker.

After that the cluster became unresponsive (any ceph command hangs
indefinitely). I can use admin socket to manipulate config though. Setting
debug_ms to 5 I see this in the logs (timestamps cut for readability):

7f4096f41700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
[v2:<mon2_ip>:3300/0,v1:<mon2_ip>:6789/0] conn(0x55c21b975800
0x55c21ab45180 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rx=0 tx=
0).send_message enqueueing message m=0x55c21bd84a00 type=67 mon_probe(probe
e30397f0-cc32-11ea-8c8e-000c29469cd5 name mon1 mon_release octopus) v7
7f4098744700  1 --  >>
[v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1 s=STATE_CONNECTING_RE
l=0).process reconnect failed to v2:81.200.2
.152:6800/561959008
7f4098744700  2 --  >>
[v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1 s=STATE_CONNECTING_RE
l=0).process connection refused!

and this:

7f4098744700  2 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
 conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0
l=1 rx=0 tx=0)._fault on lossy channel, failing
7f4098744700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
 conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0
l=1 rx=0 tx=0).stop
7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
 conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0
l=1 rx=0 tx=0).reset_recv_state
7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
 conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0 cs=0
l=1 rx=0 tx=0).reset_security
7f409373a700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
 conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0
tx=0).accept
7f4098744700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
 conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=BANNER_ACCEPTING pgs=0
cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
 conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=8
peer_addr_for_me=v2:<mon1_ip>:3300/0
7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
 conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
cs=0 l=0 rx=0 tx=0).handle_hello getsockname says I am <mon1_ip>:3300 when
talking to v2:<mon1_ip>:49012/0
7f4098744700  1 mon.mon1@0(probing) e5 handle_auth_request failed to assign
global_id

Config (the result of ceph --admin-daemon
/run/ceph/e30397f0-cc32-11ea-8c8e-000c29469cd5/ceph-mon.mon1.asok config
show):
https://pastebin.com/kifMXs9H

I can connect to ports 3300 and 6789 with telnet; 6800 and 6801 return
'process connection refused'

Setting all iptables policies to ACCEPT didn't change anything.

Where should I start digging to fix this problem? I'd like to at least
understand why this happened before putting the cluster into production.
Any help is appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx