I'm having a similar issue with ceph-mgr stability problems since upgrading from 13.2.5 to 13.2.6. I have isolated the crashing to the prometheus module being enabled and notice much better stability when the prometheus module is NOT enabled. No more failovers, however I do notice that even with prometheus disabled my haproxy keepalive checks for the ceph-dashboard are failing their Layer-4 checks occasionally but not enough that the ceph-dashboard does not work. I had no perceived issues on 13.2.5. I have opened a ceph bug for 13.2.6 for which they have acknowledged as a duplicate to another ticket. Thankfully we also use Zabbix for monitoring so until this is fixed I am relying on Zabbix metrics instead of metrics in Prometheus/Grafana. Ceph bug here: https://tracker.ceph.com/issues/41171 # ceph mgr module ls { "enabled_modules": [ "balancer", "crash", "dashboard", "restful", "status", "zabbix" ], On Wed, Aug 14, 2019 at 12:09 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote: > > Hi all, > > I'm experiencing some stability issues with my ceph-mgr daemons post Nautilus upgrade (13.2.6->14.2.2). > On top of Ubuntu 18.04, the mgr's are colocated with the mon's, and 2 of the 3 mon/mgr nodes also colocate an MDS, running a single rank active:standby-replay. > > The main issue(s) I am seeing are that I will have mgr's silently die, to where ceph -s will report "no active mgr", despite none of the mgr daemons crashing or dying. Typically they end up coming back up/in after some amount of time, but metrics stall, and mgr commands will stall/hang. > > $ ceph mgr module ls > { > "enabled_modules": [ > "diskprediction_local", > "influx", > "iostat", > "prometheus", > "rbd_support", > "restful", > "telemetry" > ], > > > I ran the mgr with debug=20 and was unable to find anything terribly useful there, but can try to reproduce things if that seems helpful. > The only real non-standard bits of my mgr instances is that I cherry-picked this change to module.py to cure an issue with the influx plugin. > Typically these stalls happen when the mgr is servicing a command, and lately that command has been the balancer running. > > For example: > > $ time ceph balancer eval $POOL --debug-ms 1 > 2019-08-14 11:36:21.370 7f4990601700 1 Processor -- start > 2019-08-14 11:36:21.370 7f4990601700 1 -- start start > 2019-08-14 11:36:21.370 7f4990601700 1 --2- >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect > 2019-08-14 11:36:21.370 7f4990601700 1 --2- >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=NONE pgs=0 cs=0 l=0 rx=0 tx=0).connect > 2019-08-14 11:36:21.370 7f4990601700 1 -- --> v1:10.0.1.250:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f498808bbb0 con 0x7f49881684c0 > 2019-08-14 11:36:21.370 7f4990601700 1 -- --> v2:10.0.1.249:3300/0 -- mon_getmap magic: 0 v1 -- 0x7f498808ce00 con 0x7f498816f6a0 > 2019-08-14 11:36:21.370 7f4990601700 1 -- --> v2:10.0.1.250:3300/0 -- mon_getmap magic: 0 v1 -- 0x7f498808cfb0 con 0x7f4988168a80 > 2019-08-14 11:36:21.370 7f498dbc6700 1 --2- >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 > 2019-08-14 11:36:21.370 7f498dbc6700 1 --2- >> v2:10.0.1.250:3300/0 conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=HELLO_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).handle_hello peer v2:10.0.1.250:3300/0 says I am v2:10.0.1.250:44952/0 (socket says 10.0.1.250:44952) > 2019-08-14 11:36:21.370 7f498dbc6700 1 -- 10.0.1.250:0/2021657657 learned_addr learned my addr 10.0.1.250:0/2021657657 (peer_addr_for_me v2:10.0.1.250:0/0) > 2019-08-14 11:36:21.370 7f498ebc8700 1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 > 2019-08-14 11:36:21.370 7f498dbc6700 1 -- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 msgr2=0x7f4988171b00 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=0).mark_down > 2019-08-14 11:36:21.370 7f498dbc6700 1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=0 rx=0 tx=0).stop > 2019-08-14 11:36:21.370 7f498dbc6700 1 -- 10.0.1.250:0/2021657657 >> v1:10.0.1.250:6789/0 conn(0x7f49881684c0 legacy=0x7f49881692e0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:36:21.370 7f498dbc6700 1 -- 10.0.1.250:0/2021657657 --> v2:10.0.1.250:3300/0 -- mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f498808d1b0 con 0x7f4988168a80 > 2019-08-14 11:36:21.370 7f498dbc6700 1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).ready entity=mon.1 client_cookie=10837f5a102e309b server_cookie=0 in_seq=0 out_seq=0 > 2019-08-14 11:36:21.370 7f498d3c5700 1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c0020d0 con 0x7f4988168a80 > 2019-08-14 11:36:21.370 7f498d3c5700 1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 0x7f497c002410 con 0x7f4988168a80 > 2019-08-14 11:36:21.370 7f498d3c5700 1 -- 10.0.1.250:0/2021657657 <== mon.1 v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c002b40 con 0x7f4988168a80 > 2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 msgr2=0x7f4988169ae0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:36:21.370 7f4990601700 1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 crc :-1 s=READY pgs=82826 cs=0 l=1 rx=0 tx=0).stop > 2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 shutdown_connections > 2019-08-14 11:36:21.370 7f4990601700 1 --2- 10.0.1.250:0/2021657657 >> v2:10.0.1.249:3300/0 conn(0x7f498816f6a0 0x7f4988171b00 unknown :-1 s=CLOSED pgs=0 cs=0 l=0 rx=0 tx=0).stop > 2019-08-14 11:36:21.370 7f4990601700 1 --2- 10.0.1.250:0/2021657657 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f4988168a80 0x7f4988169ae0 unknown :-1 s=CLOSED pgs=82826 cs=0 l=1 rx=0 tx=0).stop > 2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 shutdown_connections > 2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 wait complete. > 2019-08-14 11:36:21.370 7f4990601700 1 -- 10.0.1.250:0/2021657657 >> 10.0.1.250:0/2021657657 conn(0x7f4988163a40 msgr2=0x7f4988165ed0 unknown :-1 s=STATE_NONE l=0).mark_down > 2019-08-14 11:36:21.374 7f4990601700 1 Processor -- start > 2019-08-14 11:36:21.374 7f4990601700 1 -- start start > 2019-08-14 11:36:21.374 7f4990601700 1 -- --> v1:10.0.1.249:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b3f30 con 0x7f49880a8360 > 2019-08-14 11:36:21.374 7f4990601700 1 -- --> v1:10.0.1.250:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4170 con 0x7f49880b0030 > 2019-08-14 11:36:21.374 7f4990601700 1 -- --> v1:10.0.1.251:6789/0 -- auth(proto 0 30 bytes epoch 0) v1 -- 0x7f49880b4420 con 0x7f49880a7da0 > 2019-08-14 11:36:21.374 7f498ebc8700 1 --1- >> v1:10.0.1.250:6789/0 conn(0x7f49880b0030 0x7f49880b2560 :-1 s=CONNECTING_WAIT_BANNER_AND_IDENTIFY pgs=0 cs=0 l=1).handle_server_banner_and_identify peer v1:10.0.1.250:6789/0 says I am v1:10.0.1.250:57466/0 (socket says 10.0.1.250:57466) > 2019-08-14 11:36:21.374 7f498ebc8700 1 -- 10.0.1.250:0/40211647 learned_addr learned my addr 10.0.1.250:0/40211647 (peer_addr_for_me v1:10.0.1.250:0/0) > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 1 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 33+0+0 (unknown 3197104487 0 0) 0x7f497c007c60 con 0x7f49880a8360 > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 --> v1:10.0.1.249:6789/0 -- auth(proto 2 36 bytes epoch 0) v1 -- 0x7f496c002d30 con 0x7f49880a8360 > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 2 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 935+0+0 (unknown 1101082921 0 0) 0x7f497c007f40 con 0x7f49880a8360 > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 --> v1:10.0.1.249:6789/0 -- auth(proto 2 165 bytes epoch 0) v1 -- 0x7f496c003a30 con 0x7f49880a8360 > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.0 v1:10.0.1.249:6789/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (unknown 1022418838 0 0) 0x7f497c0024e0 con 0x7f49880a8360 > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.251:6789/0 conn(0x7f49880a7da0 legacy=0x7f49880a8c50 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.250:6789/0 conn(0x7f49880b0030 legacy=0x7f49880b2560 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 >> v1:10.0.1.249:6789/0 conn(0x7f49880a8360 legacy=0x7f49880a9450 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:36:21.374 7f497b7fe700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect > 2019-08-14 11:36:21.374 7f497b7fe700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect > 2019-08-14 11:36:21.374 7f497b7fe700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c530 con 0x7f496c009b80 > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c6e0 con 0x7f496c004f10 > 2019-08-14 11:36:21.374 7f497b7fe700 1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] -- mon_getmap magic: 0 v1 -- 0x7f496c00c890 con 0x7f496c003ed0 > 2019-08-14 11:36:21.374 7f498dbc6700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 > 2019-08-14 11:36:21.374 7f498e3c7700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 > 2019-08-14 11:36:21.374 7f498ebc8700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 > 2019-08-14 11:36:21.378 7f498dbc6700 1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 msgr2=0x7f496c0049c0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:36:21.378 7f498dbc6700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.251:3300/0,v1:10.0.1.251:6789/0] conn(0x7f496c003ed0 0x7f496c0049c0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop > 2019-08-14 11:36:21.378 7f498dbc6700 1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 msgr2=0x7f496c00bfe0 unknown :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:36:21.378 7f498dbc6700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0).stop > 2019-08-14 11:36:21.378 7f498dbc6700 1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({config=0+,monmap=0+}) v3 -- 0x7f497c008180 con 0x7f496c004f10 > 2019-08-14 11:36:21.378 7f4990601700 1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({mgrmap=0+}) v3 -- 0x7f49880b47e0 con 0x7f496c004f10 > 2019-08-14 11:36:21.378 7f498ebc8700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.249:3300/0,v1:10.0.1.249:6789/0] conn(0x7f496c009b80 0x7f496c00bfe0 unknown :-1 s=CLOSED pgs=0 cs=0 l=1 rx=0 tx=0).handle_auth_reply_more state changed! > 2019-08-14 11:36:21.378 7f498ebc8700 1 -- 10.0.1.250:0/40211647 reap_dead start > 2019-08-14 11:36:21.378 7f4990601700 1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_subscribe({osdmap=0}) v3 -- 0x7f49880a8360 con 0x7f496c004f10 > 2019-08-14 11:36:21.378 7f498dbc6700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).ready entity=mon.1 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 > 2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 1 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f49880b47e0 con 0x7f496c004f10 > 2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 2 ==== config(6 keys) v1 ==== 207+0+0 (crc 0 0 0) 0x7f497c0010a0 con 0x7f496c004f10 > 2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 3 ==== mon_map magic: 0 v1 ==== 395+0+0 (crc 0 0 0) 0x7f497c001490 con 0x7f496c004f10 > 2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 4 ==== mgrmap(e 2536) v1 ==== 28602+0+0 (crc 0 0 0) 0x7f497c001690 con 0x7f496c004f10 > 2019-08-14 11:36:21.378 7f497b7fe700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect > 2019-08-14 11:36:21.378 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 5 ==== osd_map(172849..172849 src has 171976..172849) v4 ==== 106776+0+0 (crc 0 0 0) 0x7f497c0489d0 con 0x7f496c004f10 > 2019-08-14 11:36:21.378 7f498e3c7700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=BANNER_CONNECTING pgs=0 cs=0 l=1 rx=0 tx=0)._handle_peer_banner_payload supported=0 required=0 > 2019-08-14 11:36:21.378 7f498e3c7700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).ready entity=mgr.1731601147 client_cookie=0 server_cookie=0 in_seq=0 out_seq=0 > 2019-08-14 11:36:21.382 7f4990601700 1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- 0x7f49680031e0 con 0x7f496c004f10 > 2019-08-14 11:36:21.390 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mon.1 v2:10.0.1.250:3300/0 6 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0 v0) v1 ==== 72+0+101614 (crc 0 0 0) 0x7f497c000e90 con 0x7f496c004f10 > 2019-08-14 11:36:21.586 7f4990601700 1 -- 10.0.1.250:0/40211647 --> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] -- command(tid 0: {"prefix": "balancer eval", "option": "$POOL", "target": ["mgr", ""]}) v1 -- 0x7f4968002150 con 0x7f496c028790 > 2019-08-14 11:37:40.274 7f497b7fe700 1 -- 10.0.1.250:0/40211647 <== mgr.1731601147 v2:10.0.1.250:6800/34061 1 ==== command_reply(tid 0: 0 ) v1 ==== 8+0+47 (crc 0 0 0) 0x7f4968002150 con 0x7f496c028790 > pool "$POOL" score 0.009701 (lower is better) > 2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 msgr2=0x7f496c02ace0 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:37:40.306 7f4990601700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 crc :-1 s=READY pgs=16527 cs=0 l=1 rx=0 tx=0).stop > 2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 msgr2=0x7f496c005360 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).mark_down > 2019-08-14 11:37:40.306 7f4990601700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 crc :-1 s=READY pgs=82828 cs=0 l=1 rx=0 tx=0).stop > 2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 shutdown_connections > 2019-08-14 11:37:40.306 7f4990601700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:6800/34061,v1:10.0.1.250:6801/34061] conn(0x7f496c028790 0x7f496c02ace0 unknown :-1 s=CLOSED pgs=16527 cs=0 l=1 rx=0 tx=0).stop > 2019-08-14 11:37:40.306 7f4990601700 1 --2- 10.0.1.250:0/40211647 >> [v2:10.0.1.250:3300/0,v1:10.0.1.250:6789/0] conn(0x7f496c004f10 0x7f496c005360 unknown :-1 s=CLOSED pgs=82828 cs=0 l=1 rx=0 tx=0).stop > 2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 shutdown_connections > 2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 wait complete. > 2019-08-14 11:37:40.306 7f4990601700 1 -- 10.0.1.250:0/40211647 >> 10.0.1.250:0/40211647 conn(0x7f498809d440 msgr2=0x7f498809d8b0 unknown :-1 s=STATE_NONE l=0).mark_down > > real 1m19.023s > user 0m0.288s > sys 0m0.050s > > > While the mgr didn't die, it did take 80 seconds to eval a pool, which typically takes roughly 2-5 seconds to run. > And while this command was running, the ceph-mgr processes was sitting around 100% in top. > > 2019-08-14 11:36:20.734 7fbd77570700 0 ms_deliver_dispatch: unhandled message 0x55d96ee809c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953 > 2019-08-14 11:36:20.734 7fbd8a95a700 1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory > 2019-08-14 11:36:21.586 7fbd77570700 0 log_channel(audit) log [DBG] : from='client.1731656635 -' entity='client.admin' cmd=[{"prefix": "balancer eval", "option": "$POOL", "target": ["mgr", ""]}]: dispatch > 2019-08-14 11:36:21.734 7fbd77570700 0 ms_deliver_dispatch: unhandled message 0x55d968250a40 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953 > 2019-08-14 11:36:21.734 7fbd8a95a700 1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory > 2019-08-14 11:36:22.734 7fbd77570700 0 ms_deliver_dispatch: unhandled message 0x55d969297a00 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953 > [snip-repeating] > > 2019-08-14 11:37:38.826 7fbd77570700 0 ms_deliver_dispatch: unhandled message 0x55d96b768d80 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953 > 2019-08-14 11:37:38.826 7fbd8a95a700 1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory > 2019-08-14 11:37:39.574 7fbd77d71700 1 mgr[balancer] Handling command: '{'prefix': 'balancer eval', 'option': '$POOL', 'target': ['mgr', '']}' > > 2019-08-14 11:37:40.186 7fbd77570700 0 ms_deliver_dispatch: unhandled message 0x55d967aa23c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953 > 2019-08-14 11:37:40.186 7fbd8a95a700 1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory > 2019-08-14 11:37:41.054 7fbd77570700 0 ms_deliver_dispatch: unhandled message 0x55d96117c9c0 mgrreport(mds.db +110-0 packed 1366) v7 from mds.1717053370 v2:10.0.1.251:6800/140458953 > 2019-08-14 11:37:41.054 7fbd8a95a700 1 mgr finish mon failed to return metadata for mds.db: (2) No such file or directory > > > I filtered out a bunch of messages with the mgr servicing `osd pool stats` requests, which I have running in a watch -n1. > And I also snipped a bunch of repeating messages about mds.db, the standby-replay mds, and not sure if that is anything worth caring about. > > Hoping someone may have some ideas of where to look to try and improve stability with the mgr's. > > Thanks, > > Reed > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx