Re: Mgr stability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm having a similar issue with ceph-mgr stability problems since
upgrading from 13.2.5 to 13.2.6. I have isolated the crashing to the
prometheus module being enabled and notice much better stability when
the prometheus module is NOT enabled. No more failovers, however I do
notice that even with prometheus disabled my haproxy keepalive checks
for the ceph-dashboard are failing their Layer-4 checks occasionally
but not enough that the ceph-dashboard does not work. I had no
perceived issues on 13.2.5. I have opened a ceph bug for 13.2.6 for
which they have acknowledged as a duplicate to another ticket.
Thankfully we also use Zabbix for monitoring so until this is fixed I
am relying on Zabbix metrics instead of metrics in Prometheus/Grafana.

Ceph bug here: https://tracker.ceph.com/issues/41171

# ceph mgr module ls
{
    "enabled_modules": [
        "balancer",
        "crash",
        "dashboard",
        "restful",
        "status",
        "zabbix"
    ],

On Wed, Aug 14, 2019 at 12:09 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote:
>
> Hi all,
>
> I'm experiencing some stability issues with my ceph-mgr daemons post Nautilus upgrade (13.2.6->14.2.2).
> On top of Ubuntu 18.04, the mgr's are colocated with the mon's, and 2 of the 3 mon/mgr nodes also colocate an MDS, running a single rank active:standby-replay.
>
> The main issue(s) I am seeing are that I will have mgr's silently die, to where ceph -s will report "no active mgr", despite none of the mgr daemons crashing or dying. Typically they end up coming back up/in after some amount of time, but metrics stall, and mgr commands will stall/hang.
>
> $ ceph mgr module ls
> {
>     "enabled_modules": [
>         "diskprediction_local",
>         "influx",
>         "iostat",
>         "prometheus",
>         "rbd_support",
>         "restful",
>         "telemetry"
>     ],
>
>
> I ran the mgr with debug=20 and was unable to find anything terribly useful there, but can try to reproduce things if that seems helpful.
> The only real non-standard bits of my mgr instances is that I cherry-picked this change to module.py to cure an issue with the influx plugin.
> Typically these stalls happen when the mgr is servicing a command, and lately that command has been the balancer running.
>
> For example:
>
> $ time ceph balancer eval $POOL --debug-ms 1
> 2019-08-14 11:36:21.370 7f4990601700  1  Processor -- start
> 2019-08-14 11:36:21.370 7f4990601700  1 --  start start
> 2019-08-14 11:36:21.370 7f4990601700  1 --2-  >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect
> 2019-08-14 11:36:21.370 7f4990601700  1 --2-  >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect
> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v1:10.0.1.250:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f498808bbb0 con 0x7f49881684c0
> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v2:10.0.1.249:3300/0 -- mon_getmap magic: 0 v1 -- 0x7f498808ce00 con 0x7f498816f6a0
> 2019-08-14 11:36:21.370 7f4990601700  1 --  --> v2:10.0.1.250:3300/0 -- mon_getmap magic: 0 v1 -- 0x7f498808cfb0 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2-  >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2-  >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello peer v2:10.0.1.250:3300/0 says I am v2:10.0.1.250:44952/0 (socket says 10.0.1.250:44952)
> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 learned_addr learned my addr 10.0.1.250:0/2021657657 (peer_addr_for_me v2:10.0.1.250:0/0)
> 2019-08-14 11:36:21.370 7f498ebc8700  1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 msgr2=0x7f4988171b00 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down
> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).stop
> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 >> v1:10.0.1.250:6789/0 conn(0x7f49881684c0 legacy=0x7f49881692e0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 --> v2:10.0.1.250:3300/0 -- mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f498808d1b0 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f498dbc6700  1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).ready entity=mon.1 client_cookie=10837f5a102e309b server_cookie=0 in_seq=0 out_seq=0
> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c0020d0 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 0x7f497c002410 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c002b40 con 0x7f4988168a80
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 msgr2=0x7f4988169ae0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 shutdown_connections
> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rx=0 tx=0).stop
> 2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=CLOSED pgs=82826 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 shutdown_connections
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 wait complete.
> 2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 >> 10.0.1.250:0/2021657657 conn(0x7f4988163a40 msgr2=0x7f4988165ed0 unknown :-1 s=STATE_NONE l=0).mark_down
> 2019-08-14 11:36:21.374 7f4990601700  1  Processor -- start
> 2019-08-14 11:36:21.374 7f4990601700  1 --  start start
> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.249:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b3f30 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.250:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4170 con 0x7f49880b0030
> 2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.251:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4420 con 0x7f49880a7da0
> 2019-08-14 11:36:21.374 7f498ebc8700  1 --1-  >> v1:10.0.1.250:6789/0 conn(0x7f49880b0030 0x7f49880b2560 :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).handle_server_banner_and_identify peer v1:10.0.1.250:6789/0 says I am v1:10.0.1.250:57466/0 (socket says 10.0.1.250:57466)
> 2019-08-14 11:36:21.374 7f498ebc8700  1 -- 10.0.1.250:0/40211647 learned_addr learned my addr 10.0.1.250:0/40211647 (peer_addr_for_me v1:10.0.1.250:0/0)
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (unknown 3197104487 0 0) 0x7f497c007c60 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> v1:10.0.1.249:6789/0 -- auth(proto 2 36 bytes epoch 0) v1 -- 0x7f496c002d30 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 935+0+0 (unknown 1101082921 0 0) 0x7f497c007f40 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> v1:10.0.1.249:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7f496c003a30 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (unknown 1022418838 0 0) 0x7f497c0024e0 con 0x7f49880a8360
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.251:6789/0 conn(0x7f49880a7da0 legacy=0x7f49880a8c50 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.250:6789/0 conn(0x7f49880b0030 legacy=0x7f49880b2560 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.249:6789/0 conn(0x7f49880a8360 legacy=0x7f49880a9450 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
> 2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c530 con 0x7f496c009b80
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c6e0 con 0x7f496c004f10
> 2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c890 con 0x7f496c003ed0
> 2019-08-14 11:36:21.374 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.374 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.374 7f498ebc8700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 msgr2=0x7f496c0049c0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 msgr2=0x7f496c00bfe0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f497c008180 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f4990601700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({mgrmap=0+}) v3 -- 0x7f49880b47e0 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f498ebc8700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=CLOSED pgs=0 cs=0 l=1 rx=0 tx=0).handle_auth_reply_more state changed!
> 2019-08-14 11:36:21.378 7f498ebc8700  1 -- 10.0.1.250:0/40211647 reap_dead start
> 2019-08-14 11:36:21.378 7f4990601700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({osdmap=0}) v3 -- 0x7f49880a8360 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).ready entity=mon.1 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f49880b47e0 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 0x7f497c0010a0 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c001490 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 4 ==== mgrmap(e 2536) v1 ==== 28602+0+0 (crc 0 0 0) 0x7f497c001690 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
> 2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 5 ==== osd_map(172849..172849 src has 171976..172849) v4 ==== 106776+0+0 (crc 0 0 0) 0x7f497c0489d0 con 0x7f496c004f10
> 2019-08-14 11:36:21.378 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
> 2019-08-14 11:36:21.378 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).ready entity=mgr.1731601147 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
> 2019-08-14 11:36:21.382 7f4990601700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7f49680031e0 con 0x7f496c004f10
> 2019-08-14 11:36:21.390 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 ==== 72+0+101614 (crc 0 0 0) 0x7f497c000e90 con 0x7f496c004f10
> 2019-08-14 11:36:21.586 7f4990601700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] -- command(tid 0: {"prefix": "balancer eval", "option": "$POOL", "target": ["mgr", ""]}) v1 -- 0x7f4968002150 con 0x7f496c028790
> 2019-08-14 11:37:40.274 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mgr.1731601147 v2:10.0.1.250:6800/34061 1 ==== command_reply(tid 0: 0 ) v1 ==== 8+0+47 (crc 0 0 0) 0x7f4968002150 con 0x7f496c028790
> pool "$POOL" score 0.009701 (lower is better)
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 msgr2=0x7f496c02ace0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 msgr2=0x7f496c005360 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 shutdown_connections
> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=CLOSED pgs=16527 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=CLOSED pgs=82828 cs=0 l=1 rx=0 tx=0).stop
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 shutdown_connections
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 wait complete.
> 2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> 10.0.1.250:0/40211647 conn(0x7f498809d440 msgr2=0x7f498809d8b0 unknown :-1 s=STATE_NONE l=0).mark_down
>
> real    1m19.023s
> user    0m0.288s
> sys     0m0.050s
>
>
> While the mgr didn't die, it did take 80 seconds to eval a pool, which typically takes roughly 2-5 seconds to run.
> And while this command was running, the ceph-mgr processes was sitting around 100% in top.
>
> 2019-08-14 11:36:20.734 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d96ee809c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:36:20.734 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
> 2019-08-14 11:36:21.586 7fbd77570700  0 log_channel(audit) log [DBG] : from='client.1731656635 -' entity='client.admin' cmd=[{"prefix": "balancer eval", "option": "$POOL", "target": ["mgr", ""]}]: dispatch
> 2019-08-14 11:36:21.734 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d968250a40 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:36:21.734 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
> 2019-08-14 11:36:22.734 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d969297a00 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
> [snip-repeating]
>
> 2019-08-14 11:37:38.826 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d96b768d80 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:37:38.826 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
> 2019-08-14 11:37:39.574 7fbd77d71700  1 mgr[balancer] Handling command: '{'prefix': 'balancer eval', 'option': '$POOL', 'target': ['mgr', '']}'
>
> 2019-08-14 11:37:40.186 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d967aa23c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:37:40.186 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
> 2019-08-14 11:37:41.054 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d96117c9c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
> 2019-08-14 11:37:41.054 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
>
>
> I filtered out a bunch of messages with the mgr servicing `osd pool stats` requests, which I have running in a watch -n1.
> And I also snipped a bunch of repeating messages about mds.db, the standby-replay mds, and not sure if that is anything worth caring about.
>
> Hoping someone may have some ideas of where to look to try and improve stability with the mgr's.
>
> Thanks,
>
> Reed
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux