Re: Cluster became unresponsive: e5 handle_auth_request failed to assign global_id

Илья Борисович Волошин <i.voloshin@xxxxxxxxxxxxxxxxxx> · Mon, 27 Jul 2020 17:28:04 +0300

Here are all the active ports on mon1 (with the exception of sshd and ntpd):

# netstat -npl
Proto Recv-Q Send-Q Local Address           Foreign Address         State
    PID/Program name
tcp        0      0 <mon1_ip>:3300      0.0.0.0:*               LISTEN
 1582/ceph-mon
tcp        0      0 <mon1_ip>:6789          0.0.0.0:*               LISTEN
     1582/ceph-mon
tcp6       0      0 :::9093                 :::*                    LISTEN
     908/alertmanager
tcp6       0      0 :::9094                 :::*                    LISTEN
     908/alertmanager
tcp6       0      0 :::9095                 :::*                    LISTEN
     896/prometheus
tcp6       0      0 :::9100                 :::*                    LISTEN
     906/node_exporter
tcp6       0      0 :::3000                 :::*                    LISTEN
     882/grafana-server
udp6       0      0 :::9094                 :::*
     908/alertmanager

I've tried telnet from mon1 host, can connect to 3300 and 6789:

# telnet <mon1_ip> 3300

Trying <mon1_ip>...

Connected to <mon1_ip>.

Escape character is '^]'.

ceph v2

# telnet <mon1_ip> 6789
Trying <mon1_ip>...
Connected to <mon1_ip>.
Escape character is '^]'.
ceph v027QQ

6800 and 6801 refuse connection:

# telnet <mon1_ip> 6800
Trying <mon1_ip>...
telnet: Unable to connect to remote host: Connection refused

I don't see any errors in the log related to failures to bind... and all
CEPH systemd services are running as far as I can tell:

# systemctl list-units -a | grep ceph
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@alertmanager.mon1.service
                            loaded    active   running   Ceph
alertmanager.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@crash.mon1.service
                             loaded    active   running   Ceph crash.mon1
for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@grafana.mon1.service
                             loaded    active   running   Ceph grafana.mon1
for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@mgr.mon1.peevkl.service
                            loaded    active   running   Ceph
mgr.mon1.peevkl for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@mon.mon1.service
                             loaded    active   running   Ceph mon.mon1 for
e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@node-exporter.mon1.service
                             loaded    active   running   Ceph
node-exporter.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@prometheus.mon1.service
                            loaded    active   running   Ceph
prometheus.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
  system-ceph\x2de30397f0\x2dcc32\x2d11ea\x2d8c8e\x2d000c29469cd5.slice
                            loaded    active   active
 system-ceph\x2de30397f0\x2dcc32\x2d11ea\x2d8c8e\x2d000c29469cd5.slice
  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5.target
                             loaded    active   active    Ceph cluster
e30397f0-cc32-11ea-8c8e-000c29469cd5
  ceph.target
                            loaded    active   active    All Ceph clusters
and services

Here are currently active docker images:

# docker ps
CONTAINER ID        IMAGE                        COMMAND
 CREATED             STATUS              PORTS               NAMES
dfd8dbeccf1e        ceph/ceph:v15                "/usr/bin/ceph-mgr -…"
41 minutes ago      Up 41 minutes
ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-mgr.mon1.peevkl
9452d1db7ffb        ceph/ceph:v15                "/usr/bin/ceph-mon -…"   3
hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-mon.mon1
703ec4a43824        prom/prometheus:v2.18.1      "/bin/prometheus --c…"   3
hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-prometheus.mon1
d816ec5e645f        ceph/ceph:v15                "/usr/bin/ceph-crash…"   3
hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-crash.mon1
38d283ba6424        ceph/ceph-grafana:latest     "/bin/sh -c 'grafana…"   3
hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-grafana.mon1
cc119ec8f09a        prom/node-exporter:v0.18.1   "/bin/node_exporter …"   3
hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-node-exporter.mon1
aa1d339c4100        prom/alertmanager:v0.20.0    "/bin/alertmanager -…"   3
hours ago         Up 3 hours
 ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-alertmanager.mon1

iptables are active, I tried setting all chain policies to ACCEPT (didn't
help), the rules are as such:

    0     0 CEPH       tcp  --  *      *       0.0.0.0/0
0.0.0.0/0            tcp dpt:6789
 5060  303K CEPH       tcp  --  *      *       0.0.0.0/0
0.0.0.0/0            multiport dports 6800:7300

Chain CEPH includes addresses for monitors and OSDs.

пн, 27 июл. 2020 г. в 17:07, Dino Godor <dg@xxxxxxxxxxxx>:

> Hi,
>
> have you tried to locally connect to the ports with netcat (or telnet)?
>
> Is the process listening ? (something like netstat -4ln or the current
> equivalent thereof)
>
> Is the old (new) Firewall maybe still running ?
>
>
> On 27.07.20 16:00, Илья Борисович Волошин wrote:
> > Hello,
> >
> > I've created an Octopus 15.2.4 cluster with 3 monitors and 3 OSDs (6
> hosts
> > in total, all ESXi VMs). It lived through a couple of reboots without
> > problem, then I've reconfigured the main host a bit:
> > set iptables-legacy as current option in update-alternatives (this is a
> > Debian10 system), applied a basic ruleset of iptables and restarted
> docker.
> >
> > After that the cluster became unresponsive (any ceph command hangs
> > indefinitely). I can use admin socket to manipulate config though.
> Setting
> > debug_ms to 5 I see this in the logs (timestamps cut for readability):
> >
> > 7f4096f41700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> > [v2:<mon2_ip>:3300/0,v1:<mon2_ip>:6789/0] conn(0x55c21b975800
> > 0x55c21ab45180 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rx=0 tx=
> > 0).send_message enqueueing message m=0x55c21bd84a00 type=67
> mon_probe(probe
> > e30397f0-cc32-11ea-8c8e-000c29469cd5 name mon1 mon_release octopus) v7
> > 7f4098744700  1 --  >>
> > [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
> > conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1
> s=STATE_CONNECTING_RE
> > l=0).process reconnect failed to v2:81.200.2
> > .152:6800/561959008
> > 7f4098744700  2 --  >>
> > [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
> > conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1
> s=STATE_CONNECTING_RE
> > l=0).process connection refused!
> >
> > and this:
> >
> > 7f4098744700  2 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >   conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
> > l=1 rx=0 tx=0)._fault on lossy channel, failing
> > 7f4098744700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >   conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
> > l=1 rx=0 tx=0).stop
> > 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >   conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
> > l=1 rx=0 tx=0).reset_recv_state
> > 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >   conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> cs=0
> > l=1 rx=0 tx=0).reset_security
> > 7f409373a700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >   conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=NONE pgs=0 cs=0 l=0
> rx=0
> > tx=0).accept
> > 7f4098744700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >   conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=BANNER_ACCEPTING pgs=0
> > cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> > 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >   conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
> > cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=8
> > peer_addr_for_me=v2:<mon1_ip>:3300/0
> > 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >   conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
> > cs=0 l=0 rx=0 tx=0).handle_hello getsockname says I am <mon1_ip>:3300
> when
> > talking to v2:<mon1_ip>:49012/0
> > 7f4098744700  1 mon.mon1@0(probing) e5 handle_auth_request failed to
> assign
> > global_id
> >
> > Config (the result of ceph --admin-daemon
> > /run/ceph/e30397f0-cc32-11ea-8c8e-000c29469cd5/ceph-mon.mon1.asok config
> > show):
> > https://pastebin.com/kifMXs9H
> >
> > I can connect to ports 3300 and 6789 with telnet; 6800 and 6801 return
> > 'process connection refused'
> >
> > Setting all iptables policies to ACCEPT didn't change anything.
> >
> > Where should I start digging to fix this problem? I'd like to at least
> > understand why this happened before putting the cluster into production.
> > Any help is appreciated.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx