Re: Cluster became unresponsive: e5 handle_auth_request failed to assign global_id

Илья Борисович Волошин <i.voloshin@xxxxxxxxxxxxxxxxxx> · Tue, 28 Jul 2020 11:58:49 +0300

No, they are stored locally on ESXi data storage on top of hardware RAID5
built with SAS/SATA (different hardware on hosts).

Also, I've tried going back to the snapshot taken just after all monitors
and OSDs were added to cluster. The host boots fine and is working as it
should, however, after the next reboot this problem appears (no changes to
configuration were made).

And another thing - even if docker container for mgr is running and gives
no errors in logs neither inside the container nor on the parent host the
mgr doesn't bind to any ports it should: 6800, 6801 and 8443 for dashboard.
Not sure if it is the reason or the consequence of this problem.

вт, 28 июл. 2020 г. в 11:37, Anthony D'Atri <anthony.datri@xxxxxxxxx>:

> Are your mon DBs on SSDs?
>
> > On Jul 27, 2020, at 7:28 AM, Илья Борисович Волошин <
> i.voloshin@xxxxxxxxxxxxxxxxxx> wrote:
> >
> > Here are all the active ports on mon1 (with the exception of sshd and
> ntpd):
> >
> > # netstat -npl
> > Proto Recv-Q Send-Q Local Address           Foreign Address         State
> >    PID/Program name
> > tcp        0      0 <mon1_ip>:3300      0.0.0.0:*               LISTEN
> > 1582/ceph-mon
> > tcp        0      0 <mon1_ip>:6789          0.0.0.0:*
>  LISTEN
> >     1582/ceph-mon
> > tcp6       0      0 :::9093                 :::*
> LISTEN
> >     908/alertmanager
> > tcp6       0      0 :::9094                 :::*
> LISTEN
> >     908/alertmanager
> > tcp6       0      0 :::9095                 :::*
> LISTEN
> >     896/prometheus
> > tcp6       0      0 :::9100                 :::*
> LISTEN
> >     906/node_exporter
> > tcp6       0      0 :::3000                 :::*
> LISTEN
> >     882/grafana-server
> > udp6       0      0 :::9094                 :::*
> >     908/alertmanager
> >
> > I've tried telnet from mon1 host, can connect to 3300 and 6789:
> >
> > # telnet <mon1_ip> 3300
>
> > Trying <mon1_ip>...
> > Connected to <mon1_ip>.
> > Escape character is '^]'.
> > ceph v2
> >
> > # telnet <mon1_ip> 6789
> > Trying <mon1_ip>...
> > Connected to <mon1_ip>.
> > Escape character is '^]'.
> > ceph v027QQ
> >
> > 6800 and 6801 refuse connection:
> >
> > # telnet <mon1_ip> 6800
> > Trying <mon1_ip>...
> > telnet: Unable to connect to remote host: Connection refused
> >
> > I don't see any errors in the log related to failures to bind... and all
> > CEPH systemd services are running as far as I can tell:
> >
> > # systemctl list-units -a | grep ceph
> >  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@alertmanager.mon1.service
> >                            loaded    active   running   Ceph
> > alertmanager.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
> >  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@crash.mon1.service
> >                             loaded    active   running   Ceph crash.mon1
> > for e30397f0-cc32-11ea-8c8e-000c29469cd5
> >  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@grafana.mon1.service
> >                             loaded    active   running   Ceph
> grafana.mon1
> > for e30397f0-cc32-11ea-8c8e-000c29469cd5
> >  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@mgr.mon1.peevkl.service
> >                            loaded    active   running   Ceph
> > mgr.mon1.peevkl for e30397f0-cc32-11ea-8c8e-000c29469cd5
> >  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@mon.mon1.service
> >                             loaded    active   running   Ceph mon.mon1
> for
> > e30397f0-cc32-11ea-8c8e-000c29469cd5
> >  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@node-exporter.mon1.service
> >                             loaded    active   running   Ceph
> > node-exporter.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
> >  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5@prometheus.mon1.service
> >                            loaded    active   running   Ceph
> > prometheus.mon1 for e30397f0-cc32-11ea-8c8e-000c29469cd5
> >  system-ceph\x2de30397f0\x2dcc32\x2d11ea\x2d8c8e\x2d000c29469cd5.slice
> >                            loaded    active   active
> > system-ceph\x2de30397f0\x2dcc32\x2d11ea\x2d8c8e\x2d000c29469cd5.slice
> >  ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5.target
> >                             loaded    active   active    Ceph cluster
> > e30397f0-cc32-11ea-8c8e-000c29469cd5
> >  ceph.target
> >                            loaded    active   active    All Ceph clusters
> > and services
> >
> > Here are currently active docker images:
> >
> > # docker ps
> > CONTAINER ID        IMAGE                        COMMAND
> > CREATED             STATUS              PORTS               NAMES
> > dfd8dbeccf1e        ceph/ceph:v15                "/usr/bin/ceph-mgr -…"
> > 41 minutes ago      Up 41 minutes
> > ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-mgr.mon1.peevkl
> > 9452d1db7ffb        ceph/ceph:v15                "/usr/bin/ceph-mon -…"
>  3
> > hours ago         Up 3 hours
> > ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-mon.mon1
> > 703ec4a43824        prom/prometheus:v2.18.1      "/bin/prometheus --c…"
>  3
> > hours ago         Up 3 hours
> > ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-prometheus.mon1
> > d816ec5e645f        ceph/ceph:v15                "/usr/bin/ceph-crash…"
>  3
> > hours ago         Up 3 hours
> > ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-crash.mon1
> > 38d283ba6424        ceph/ceph-grafana:latest     "/bin/sh -c 'grafana…"
>  3
> > hours ago         Up 3 hours
> > ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-grafana.mon1
> > cc119ec8f09a        prom/node-exporter:v0.18.1   "/bin/node_exporter …"
>  3
> > hours ago         Up 3 hours
> > ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-node-exporter.mon1
> > aa1d339c4100        prom/alertmanager:v0.20.0    "/bin/alertmanager -…"
>  3
> > hours ago         Up 3 hours
> > ceph-e30397f0-cc32-11ea-8c8e-000c29469cd5-alertmanager.mon1
> >
> > iptables are active, I tried setting all chain policies to ACCEPT (didn't
> > help), the rules are as such:
> >
> >    0     0 CEPH       tcp  --  *      *       0.0.0.0/0
> > 0.0.0.0/0            tcp dpt:6789
> > 5060  303K CEPH       tcp  --  *      *       0.0.0.0/0
> > 0.0.0.0/0            multiport dports 6800:7300
> >
> > Chain CEPH includes addresses for monitors and OSDs.
> >
> > пн, 27 июл. 2020 г. в 17:07, Dino Godor <dg@xxxxxxxxxxxx>:
> >
> >> Hi,
> >>
> >> have you tried to locally connect to the ports with netcat (or telnet)?
> >>
> >> Is the process listening ? (something like netstat -4ln or the current
> >> equivalent thereof)
> >>
> >> Is the old (new) Firewall maybe still running ?
> >>
> >>
> >> On 27.07.20 16:00, Илья Борисович Волошин wrote:
> >>> Hello,
> >>>
> >>> I've created an Octopus 15.2.4 cluster with 3 monitors and 3 OSDs (6
> >> hosts
> >>> in total, all ESXi VMs). It lived through a couple of reboots without
> >>> problem, then I've reconfigured the main host a bit:
> >>> set iptables-legacy as current option in update-alternatives (this is a
> >>> Debian10 system), applied a basic ruleset of iptables and restarted
> >> docker.
> >>>
> >>> After that the cluster became unresponsive (any ceph command hangs
> >>> indefinitely). I can use admin socket to manipulate config though.
> >> Setting
> >>> debug_ms to 5 I see this in the logs (timestamps cut for readability):
> >>>
> >>> 7f4096f41700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>> [v2:<mon2_ip>:3300/0,v1:<mon2_ip>:6789/0] conn(0x55c21b975800
> >>> 0x55c21ab45180 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rx=0 tx=
> >>> 0).send_message enqueueing message m=0x55c21bd84a00 type=67
> >> mon_probe(probe
> >>> e30397f0-cc32-11ea-8c8e-000c29469cd5 name mon1 mon_release octopus) v7
> >>> 7f4098744700  1 --  >>
> >>> [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
> >>> conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1
> >> s=STATE_CONNECTING_RE
> >>> l=0).process reconnect failed to v2:81.200.2
> >>> .152:6800/561959008
> >>> 7f4098744700  2 --  >>
> >>> [v2:<mon1_ip>:6800/561959008,v1:<mon1_ip>:6801/561959008]
> >>> conn(0x55c21b974400 msgr2=0x55c21ab45600 unknown :-1
> >> s=STATE_CONNECTING_RE
> >>> l=0).process connection refused!
> >>>
> >>> and this:
> >>>
> >>> 7f4098744700  2 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>>  conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> >> cs=0
> >>> l=1 rx=0 tx=0)._fault on lossy channel, failing
> >>> 7f4098744700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>>  conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> >> cs=0
> >>> l=1 rx=0 tx=0).stop
> >>> 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>>  conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> >> cs=0
> >>> l=1 rx=0 tx=0).reset_recv_state
> >>> 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>>  conn(0x55c21ba38c00 0x55c21bcc5a80 secure :-1 s=AUTH_ACCEPTING pgs=0
> >> cs=0
> >>> l=1 rx=0 tx=0).reset_security
> >>> 7f409373a700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>>  conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=NONE pgs=0 cs=0 l=0
> >> rx=0
> >>> tx=0).accept
> >>> 7f4098744700  1 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>>  conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=BANNER_ACCEPTING
> pgs=0
> >>> cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> >>> 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>>  conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
> >>> cs=0 l=0 rx=0 tx=0).handle_hello received hello: peer_type=8
> >>> peer_addr_for_me=v2:<mon1_ip>:3300/0
> >>> 7f4098744700  5 --2- [v2:<mon1_ip>:3300/0,v1:<mon1_ip>:6789/0] >>
> >>>  conn(0x55c21c0d2800 0x55c21bcc3f80 unknown :-1 s=HELLO_ACCEPTING pgs=0
> >>> cs=0 l=0 rx=0 tx=0).handle_hello getsockname says I am <mon1_ip>:3300
> >> when
> >>> talking to v2:<mon1_ip>:49012/0
> >>> 7f4098744700  1 mon.mon1@0(probing) e5 handle_auth_request failed to
> >> assign
> >>> global_id
> >>>
> >>> Config (the result of ceph --admin-daemon
> >>> /run/ceph/e30397f0-cc32-11ea-8c8e-000c29469cd5/ceph-mon.mon1.asok
> config
> >>> show):
> >>> https://pastebin.com/kifMXs9H
> >>>
> >>> I can connect to ports 3300 and 6789 with telnet; 6800 and 6801 return
> >>> 'process connection refused'
> >>>
> >>> Setting all iptables policies to ACCEPT didn't change anything.
> >>>
> >>> Where should I start digging to fix this problem? I'd like to at least
> >>> understand why this happened before putting the cluster into
> production.
> >>> Any help is appreciated.
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx