Mgr stability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

I'm experiencing some stability issues with my ceph-mgr daemons post Nautilus upgrade (13.2.6->14.2.2).
On top of Ubuntu 18.04, the mgr's are colocated with the mon's, and 2 of the 3 mon/mgr nodes also colocate an MDS, running a single rank active:standby-replay.

The main issue(s) I am seeing are that I will have mgr's silently die, to where ceph -s will report "no active mgr", despite none of the mgr daemons crashing or dying. Typically they end up coming back up/in after some amount of time, but metrics stall, and mgr commands will stall/hang.

$ ceph mgr module ls
{
    "enabled_modules": [
        "diskprediction_local",
        "influx",
        "iostat",
        "prometheus",
        "rbd_support",
        "restful",
        "telemetry"
    ],

I ran the mgr with debug=20 and was unable to find anything terribly useful there, but can try to reproduce things if that seems helpful.
The only real non-standard bits of my mgr instances is that I cherry-picked this change to module.py to cure an issue with the influx plugin.
Typically these stalls happen when the mgr is servicing a command, and lately that command has been the balancer running.

For example:
$ time ceph balancer eval $POOL --debug-ms 1
2019-08-14 11:36:21.370 7f4990601700  1  Processor -- start
2019-08-14 11:36:21.370 7f4990601700  1 --  start start
2019-08-14 11:36:21.370 7f4990601700  1 --2-  >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect
2019-08-14 11:36:21.370 7f4990601700  1 --2-  >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect
2019-08-14 11:36:21.370 7f4990601700  1 --  --> v1:10.0.1.250:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f498808bbb0 con 0x7f49881684c0
2019-08-14 11:36:21.370 7f4990601700  1 --  --> v2:10.0.1.249:3300/0 -- mon_getmap magic: 0 v1 -- 0x7f498808ce00 con 0x7f498816f6a0
2019-08-14 11:36:21.370 7f4990601700  1 --  --> v2:10.0.1.250:3300/0 -- mon_getmap magic: 0 v1 -- 0x7f498808cfb0 con 0x7f4988168a80
2019-08-14 11:36:21.370 7f498dbc6700  1 --2-  >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2019-08-14 11:36:21.370 7f498dbc6700  1 --2-  >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello peer v2:10.0.1.250:3300/0 says I am v2:10.0.1.250:44952/0 (socket says 10.0.1.250:44952)
2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 learned_addr learned my addr 10.0.1.250:0/2021657657 (peer_addr_for_me v2:10.0.1.250:0/0)
2019-08-14 11:36:21.370 7f498ebc8700  1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 msgr2=0x7f4988171b00 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down
2019-08-14 11:36:21.370 7f498dbc6700  1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).stop
2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 >> v1:10.0.1.250:6789/0 conn(0x7f49881684c0 legacy=0x7f49881692e0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.370 7f498dbc6700  1 -- 10.0.1.250:0/2021657657 --> v2:10.0.1.250:3300/0 -- mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f498808d1b0 con 0x7f4988168a80
2019-08-14 11:36:21.370 7f498dbc6700  1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).ready entity=mon.1 client_cookie=10837f5a102e309b server_cookie=0 in_seq=0 out_seq=0
2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c0020d0 con 0x7f4988168a80
2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 0x7f497c002410 con 0x7f4988168a80
2019-08-14 11:36:21.370 7f498d3c5700  1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c002b40 con 0x7f4988168a80
2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 msgr2=0x7f4988169ae0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 shutdown_connections
2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rx=0 tx=0).stop
2019-08-14 11:36:21.370 7f4990601700  1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=CLOSED pgs=82826 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 shutdown_connections
2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 wait complete.
2019-08-14 11:36:21.370 7f4990601700  1 -- 10.0.1.250:0/2021657657 >> 10.0.1.250:0/2021657657 conn(0x7f4988163a40 msgr2=0x7f4988165ed0 unknown :-1 s=STATE_NONE l=0).mark_down
2019-08-14 11:36:21.374 7f4990601700  1  Processor -- start
2019-08-14 11:36:21.374 7f4990601700  1 --  start start
2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.249:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b3f30 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.250:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4170 con 0x7f49880b0030
2019-08-14 11:36:21.374 7f4990601700  1 --  --> v1:10.0.1.251:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4420 con 0x7f49880a7da0
2019-08-14 11:36:21.374 7f498ebc8700  1 --1-  >> v1:10.0.1.250:6789/0 conn(0x7f49880b0030 0x7f49880b2560 :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).handle_server_banner_and_identify peer v1:10.0.1.250:6789/0 says I am v1:10.0.1.250:57466/0 (socket says 10.0.1.250:57466)
2019-08-14 11:36:21.374 7f498ebc8700  1 -- 10.0.1.250:0/40211647 learned_addr learned my addr 10.0.1.250:0/40211647 (peer_addr_for_me v1:10.0.1.250:0/0)
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (unknown 3197104487 0 0) 0x7f497c007c60 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> v1:10.0.1.249:6789/0 -- auth(proto 2 36 bytes epoch 0) v1 -- 0x7f496c002d30 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 935+0+0 (unknown 1101082921 0 0) 0x7f497c007f40 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> v1:10.0.1.249:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7f496c003a30 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (unknown 1022418838 0 0) 0x7f497c0024e0 con 0x7f49880a8360
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.251:6789/0 conn(0x7f49880a7da0 legacy=0x7f49880a8c50 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.250:6789/0 conn(0x7f49880b0030 legacy=0x7f49880b2560 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.249:6789/0 conn(0x7f49880a8360 legacy=0x7f49880a9450 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-08-14 11:36:21.374 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c530 con 0x7f496c009b80
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c6e0 con 0x7f496c004f10
2019-08-14 11:36:21.374 7f497b7fe700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c890 con 0x7f496c003ed0
2019-08-14 11:36:21.374 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2019-08-14 11:36:21.374 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2019-08-14 11:36:21.374 7f498ebc8700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 msgr2=0x7f496c0049c0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 msgr2=0x7f496c00bfe0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:36:21.378 7f498dbc6700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f497c008180 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f4990601700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({mgrmap=0+}) v3 -- 0x7f49880b47e0 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f498ebc8700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=CLOSED pgs=0 cs=0 l=1 rx=0 tx=0).handle_auth_reply_more state changed!
2019-08-14 11:36:21.378 7f498ebc8700  1 -- 10.0.1.250:0/40211647 reap_dead start
2019-08-14 11:36:21.378 7f4990601700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({osdmap=0}) v3 -- 0x7f49880a8360 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f498dbc6700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).ready entity=mon.1 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f49880b47e0 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 0x7f497c0010a0 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c001490 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 4 ==== mgrmap(e 2536) v1 ==== 28602+0+0 (crc 0 0 0) 0x7f497c001690 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f497b7fe700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-08-14 11:36:21.378 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 5 ==== osd_map(172849..172849 src has 171976..172849) v4 ==== 106776+0+0 (crc 0 0 0) 0x7f497c0489d0 con 0x7f496c004f10
2019-08-14 11:36:21.378 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0
2019-08-14 11:36:21.378 7f498e3c7700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).ready entity=mgr.1731601147 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0
2019-08-14 11:36:21.382 7f4990601700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7f49680031e0 con 0x7f496c004f10
2019-08-14 11:36:21.390 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 ==== 72+0+101614 (crc 0 0 0) 0x7f497c000e90 con 0x7f496c004f10
2019-08-14 11:36:21.586 7f4990601700  1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] -- command(tid 0: {"prefix": "balancer eval", "option": "$POOL", "target": ["mgr", ""]}) v1 -- 0x7f4968002150 con 0x7f496c028790
2019-08-14 11:37:40.274 7f497b7fe700  1 -- 10.0.1.250:0/40211647 <== mgr.1731601147 v2:10.0.1.250:6800/34061 1 ==== command_reply(tid 0: 0 ) v1 ==== 8+0+47 (crc 0 0 0) 0x7f4968002150 con 0x7f496c028790
pool "$POOL" score 0.009701 (lower is better)
2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 msgr2=0x7f496c02ace0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 msgr2=0x7f496c005360 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down
2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 shutdown_connections
2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=CLOSED pgs=16527 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:37:40.306 7f4990601700  1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=CLOSED pgs=82828 cs=0 l=1 rx=0 tx=0).stop
2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 shutdown_connections
2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 wait complete.
2019-08-14 11:37:40.306 7f4990601700  1 -- 10.0.1.250:0/40211647 >> 10.0.1.250:0/40211647 conn(0x7f498809d440 msgr2=0x7f498809d8b0 unknown :-1 s=STATE_NONE l=0).mark_down

real    1m19.023s
user    0m0.288s
sys     0m0.050s

While the mgr didn't die, it did take 80 seconds to eval a pool, which typically takes roughly 2-5 seconds to run.
And while this command was running, the ceph-mgr processes was sitting around 100% in top.

2019-08-14 11:36:20.734 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d96ee809c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
2019-08-14 11:36:20.734 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
2019-08-14 11:36:21.586 7fbd77570700  0 log_channel(audit) log [DBG] : from='client.1731656635 -' entity='client.admin' cmd=[{"prefix": "balancer eval", "option": "$POOL", "target": ["mgr", ""]}]: dispatch
2019-08-14 11:36:21.734 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d968250a40 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
2019-08-14 11:36:21.734 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
2019-08-14 11:36:22.734 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d969297a00 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
[snip-repeating]
2019-08-14 11:37:38.826 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d96b768d80 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
2019-08-14 11:37:38.826 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
2019-08-14 11:37:39.574 7fbd77d71700  1 mgr[balancer] Handling command: '{'prefix': 'balancer eval', 'option': '$POOL', 'target': ['mgr', '']}'
2019-08-14 11:37:40.186 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d967aa23c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
2019-08-14 11:37:40.186 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory
2019-08-14 11:37:41.054 7fbd77570700  0 ms_deliver_dispatch: unhandled message 0x55d96117c9c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953
2019-08-14 11:37:41.054 7fbd8a95a700  1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory

I filtered out a bunch of messages with the mgr servicing `osd pool stats` requests, which I have running in a watch -n1.
And I also snipped a bunch of repeating messages about mds.db, the standby-replay mds, and not sure if that is anything worth caring about. 

Hoping someone may have some ideas of where to look to try and improve stability with the mgr's.

Thanks,

Reed

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux