Active mds respawns itself during standby mds reboot

Alex Litvak <alexander.v.litvak@xxxxxxxxx> · Wed, 19 Dec 2018 02:44:29 -0600

Hello everyone,

I am running mds + mon on 3 nodes.  Recently due to increased cache pressure and NUMA non-interleave effect, we decided to double the memory on the nodes from 32 G to 64 G.
We wanted to upgrade a standby node first to be able to test new memory vendor.  So without much thinking (I know now :-| ), I initiated a shutdown of a server node with standby mds / second mon.
That triggered mon election, and original primary mon mds1mgs1-la still won a leadership.  However at the same time everything became slow (high IO), and some cephfs clients couldn't get to file 
system affecting production VMs and Containers.

Eventually mds on host mds1mgs1-la respawned itself and remaining mds on mds3mgs3-la box became an active one. So what I am trying to understand is why perfectly good mds had to respawn (i.e. why 
monitors stopped seeing it) and if it is possible to avoid it in the future.  Also why on earth it took ~ 3 minutes for the failover procedure, cpu and network for very low during the upgrade time.

Below I posted some relevant logs and ceph config

Thank you and sorry for the messed up post.

####### mon log mds1mgs1-la ###################

2018-12-18 21:46:22.161172 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525969: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 
GB avail; 4074 B/s rd, 1958 kB/s wr, 219 op/s
2018-12-18 21:46:23.166556 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525970: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 
GB avail; 0 B/s rd, 1195 kB/s wr, 181 op/s
2018-12-18 21:46:23.585466 7feeb57aa700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2018-12-18 21:46:23.585564 7feeb57aa700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2018-12-18 21:46:24.258633 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525971: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 
GB avail; 0 B/s rd, 5199 kB/s wr, 628 op/s
2018-12-18 21:46:24.381022 7feeb15f5700  0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({"dumpcontents": ["summary"], "prefix": "pg dump", "format": "json-pretty"} v 0) v1
2018-12-18 21:46:24.381056 7feeb15f5700  0 log_channel(audit) log [DBG] : from='client.? 10.0.40.43:0/544239607' entity='client.admin' cmd=[{"dumpcontents": ["summary"], "prefix": "pg dump", "format": 
"json-pretty"}]: dispatch
2018-12-18 21:46:26.418475 7feeaf2ee700  0 -- 10.0.40.43:6789/0 >> 10.0.40.44:6789/0 pipe(0x47d9000 sd=14 :25424 s=2 pgs=14249204 cs=1 l=0 c=0x3f38dc0).fault with nothing to send, going to standby
2018-12-18 21:46:30.695896 7feea3acf700  0 -- 10.0.40.43:6789/0 >> 10.0.41.34:0/983518683 pipe(0x8a01000 sd=89 :6789 s=0 pgs=0 cs=0 l=0 c=0x50935a0).accept peer addr is really 10.0.41.34:0/983518683 
(socket is 10.0.41.34:59932/0)
2018-12-18 21:46:34.268583 7feeb15f5700  0 log_channel(cluster) log [INF] : mon.mds1mgs1-la calling new monitor election
2018-12-18 21:46:34.268900 7feeb15f5700  1 mon.mds1mgs1-la@0(electing).elector(715) init, last seen epoch 715
2018-12-18 21:46:36.128372 7feeb1df6700  0 mon.mds1mgs1-la@0(electing).data_health(714) update_stats avail 97% total 224 GB, used 6800 MB, avail 217 GB
2018-12-18 21:46:39.269607 7feeb1df6700  0 log_channel(cluster) log [INF] : mon.mds1mgs1-la@0 won leader election with quorum 0,2
2018-12-18 21:46:39.271499 7feeb1df6700  0 log_channel(cluster) log [INF] : HEALTH_WARN; 1 mons down, quorum 0,2 mds1mgs1-la,mds3mgs3-la
2018-12-18 21:46:39.275145 7feeb40da700  0 log_channel(cluster) log [INF] : monmap e2: 3 mons at {mds1mgs1-la=10.0.40.43:6789/0,mds2mgs2-la=10.0.40.44:6789/0,mds3mgs3-la=10.0.40.45:6789/0}
2018-12-18 21:46:39.275221 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525972: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 
GB avail; 0 B/s rd, 5262 kB/s wr, 547 op/s
2018-12-18 21:46:39.275286 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e245: 1/1/1 up {0=mds1mgs1-la=up:active}, 2 up:standby
2018-12-18 21:46:39.286476 7feeb40da700  0 log_channel(cluster) log [INF] : osdmap e21858: 54 osds: 54 up, 54 in
2018-12-18 21:46:39.296721 7feeb15f5700  0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({ " p r e f i x " : " d f " ,   " f o r m a t " : " j s o n " } v 0) v1
2018-12-18 21:46:39.296910 7feeb15f5700  0 log_channel(audit) log [DBG] : from='client.? 10.0.40.12:0/3145148323' entity='client.cinder' cmd=[{,",p,r,e,f,i,x,",:,",d,f,",,, 
,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch
2018-12-18 21:46:40.291878 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525973: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 
GB avail; 6165 B/s rd, 2711 kB/s wr, 168 op/s
2018-12-18 21:46:41.288438 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525974: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 
GB avail; 6197 B/s rd, 2683 kB/s wr, 171 op/s
2018-12-18 21:46:42.295113 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525975: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 
GB avail; 0 B/s rd, 1319 kB/s wr, 140 op/s
2018-12-18 21:46:43.302902 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525976: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 
GB avail; 0 B/s rd, 883 kB/s wr, 98 op/s
2018-12-18 21:46:43.623445 7feeb40da700  1 mon.mds1mgs1-la@0(leader).osd e21859 e21859: 54 osds: 54 up, 54 in
2018-12-18 21:46:43.624693 7feeb40da700  0 log_channel(cluster) log [INF] : osdmap e21859: 54 osds: 54 up, 54 in
2018-12-18 21:46:43.632999 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e246 print_map
epoch   246
flags   0
created 2016-01-10 23:27:39.568443
modified        2018-12-18 21:46:43.609035
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    246
last_failure_osd_epoch  21859
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in 
omap,8=no anchor table}
max_mds 1
in      0
up      {0=51598182}
failed
stopped
data_pools      12
metadata_pool   13
inline_data     disabled
51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:replay seq 1

2018-12-18 21:46:43.633577 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e246: 1/1/1 up {0=mds3mgs3-la=up:replay}
2018-12-18 21:46:43.633679 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525977: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 
GB avail; 12285 B/s rd, 614 kB/s wr, 63 op/s
2018-12-18 21:46:44.656204 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525978: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 
GB avail; 60392 kB/s rd, 666 kB/s wr, 1374 op/s
2018-12-18 21:46:45.688973 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525979: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 105 MB/s rd, 200 
kB/s wr, 1920 op/s
2018-12-18 21:46:46.731680 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e247 print_map
epoch   247
flags   0
created 2016-01-10 23:27:39.568443
modified        2018-12-18 21:46:46.707496
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    246
last_failure_osd_epoch  21859
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in 
omap,8=no anchor table}
max_mds 1
in      0
up      {0=51598182}
failed
stopped
data_pools      12
metadata_pool   13
inline_data     disabled
51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:reconnect seq 5081381

2018-12-18 21:46:46.732969 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525980: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 110 MB/s rd, 607 
kB/s wr, 1878 op/s
2018-12-18 21:46:46.733022 7feeb40da700  0 log_channel(cluster) log [INF] : mds.0 10.0.40.45:6800/1216 up:reconnect
2018-12-18 21:46:46.733057 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e247: 1/1/1 up {0=mds3mgs3-la=up:reconnect}
2018-12-18 21:46:47.753771 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525981: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 69803 kB/s rd, 1016 
kB/s wr, 1678 op/s
2018-12-18 21:46:48.778505 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525982: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 28116 kB/s rd, 774 
kB/s wr, 1184 op/s
2018-12-18 21:46:49.776763 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525983: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 4297 kB/s rd, 2518 
kB/s wr, 447 op/s
2018-12-18 21:46:50.778331 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525984: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 3006 kB/s 
wr, 129 op/s
2018-12-18 21:46:51.788283 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525985: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 29185 kB/s rd, 2460 
kB/s wr, 398 op/s
2018-12-18 21:46:51.791274 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e248 print_map
epoch   248
flags   0
created 2016-01-10 23:27:39.568443
modified        2018-12-18 21:46:51.782581
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    246
last_failure_osd_epoch  21859
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in 
omap,8=no anchor table}
max_mds 1
in      0
up      {0=51598182}
failed
stopped
data_pools      12
metadata_pool   13
inline_data     disabled
60478022:       10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1
51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:reconnect seq 5081381
2018-12-18 21:46:51.791943 7feeb40da700  0 log_channel(cluster) log [INF] : mds.? 10.0.40.43:6802/4233 up:boot
2018-12-18 21:46:51.791977 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e248: 1/1/1 up {0=mds3mgs3-la=up:reconnect}, 1 up:standby
2018-12-18 21:46:52.115683 7feeb15f5700  0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({ " p r e f i x " : " d f " ,   " f o r m a t " : " j s o n " } v 0) v1
2018-12-18 21:46:52.115991 7feeb15f5700  0 log_channel(audit) log [DBG] : from='client.? 10.0.40.11:0/4016495961' entity='client.cinder' cmd=[{,",p,r,e,f,i,x,",:,",d,f,",,, 
,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch
2018-12-18 21:46:52.796679 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525986: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 29097 kB/s rd, 15577 
kB/s wr, 591 op/s
2018-12-18 21:46:53.804900 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525987: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 62900 B/s rd, 15174 
kB/s wr, 444 op/s
2018-12-18 21:46:54.813038 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525988: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 5529 kB/s rd, 3437 
kB/s wr, 447 op/s
2018-12-18 21:46:55.821929 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525989: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 5702 kB/s rd, 21137 
kB/s wr, 2534 op/s
2018-12-18 21:46:56.823179 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525990: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 244 kB/s rd, 19310 
kB/s wr, 2344 op/s
2018-12-18 21:46:57.830129 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525991: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 42795 B/s rd, 477 
kB/s wr, 75 op/s
2018-12-18 21:46:58.833844 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525992: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8152 B/s rd, 494 
kB/s wr, 62 op/s
2018-12-18 21:46:59.838118 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525993: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8161 B/s rd, 6026 
kB/s wr, 543 op/s
2018-12-18 21:47:00.840107 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525994: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 6318 kB/s 
wr, 531 op/s
2018-12-18 21:47:01.843219 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525995: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 1047 kB/s 
wr, 93 op/s
2018-12-18 21:47:02.849617 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525996: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 272 kB/s rd, 5716 
kB/s wr, 168 op/s
2018-12-18 21:47:03.858165 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525997: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 299 kB/s rd, 5737 
kB/s wr, 155 op/s
2018-12-18 21:47:04.861351 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525998: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 28513 B/s rd, 6531 
kB/s wr, 558 op/s
2018-12-18 21:47:05.868278 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525999: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8148 B/s rd, 6701 
kB/s wr, 545 op/s
2018-12-18 21:47:06.872694 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526000: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8148 B/s rd, 989 
kB/s wr, 99 op/s
2018-12-18 21:47:07.876537 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526001: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 12235 B/s rd, 1400 
kB/s wr, 115 op/s
2018-12-18 21:47:08.881359 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526002: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 16316 B/s rd, 1728 
kB/s wr, 143 op/s
2018-12-18 21:47:09.884708 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526003: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 6121 B/s rd, 5107 
kB/s wr, 633 op/s
2018-12-18 21:47:10.892664 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526004: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8143 B/s rd, 5238 
kB/s wr, 627 op/s
2018-12-18 21:47:11.896609 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526005: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 18325 B/s rd, 1464 
kB/s wr, 177 op/s
2018-12-18 21:47:12.902995 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526006: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 36657 B/s rd, 4254 
kB/s wr, 171 op/s
2018-12-18 21:47:12.905463 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e249 print_map
epoch   249
flags   0
created 2016-01-10 23:27:39.568443
modified        2018-12-18 21:47:12.900465
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    246
last_failure_osd_epoch  21859
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in 
omap,8=no anchor table}
max_mds 1
in      0
up      {0=51598182}
failed
stopped
data_pools      12
metadata_pool   13
inline_data     disabled
60478022:       10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1
51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:rejoin seq 5081388
...
2018-12-18 21:49:15.836800 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526128: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38346 GB / 87053 
GB avail; 0 B/s rd, 5510 kB/s wr, 598 op/s
2018-12-18 21:49:16.842000 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e250 print_map
epoch   250
flags   0
created 2016-01-10 23:27:39.568443
modified        2018-12-18 21:49:16.837045
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
last_failure    246
last_failure_osd_epoch  21859
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in 
omap,8=no anchor table}
max_mds 1
in      0
up      {0=51598182}
failed
stopped
data_pools      12
metadata_pool   13
inline_data     disabled
60478022:       10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1
51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:active seq 5081420

2018-12-18 21:49:16.842643 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526129: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38346 GB / 87053 
GB avail; 0 B/s rd, 1145 kB/s wr, 76 op/s
2018-12-18 21:49:16.842685 7feeb40da700  0 log_channel(cluster) log [INF] : mds.0 10.0.40.45:6800/1216 up:active
2018-12-18 21:49:16.842709 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e250: 1/1/1 up {0=mds3mgs3-la=up:active}, 1 up:standby
...

2018-12-18 21:51:36.129356 7feeb1df6700  0 mon.mds1mgs1-la@0(leader).data_health(716) update_stats avail 97% total 224 GB, used 6801 MB, avail 217 GB
2018-12-18 21:51:41.669590 7feeb1df6700  1 mon.mds1mgs1-la@0(leader).paxos(paxos updating c 225352570..225353159) accept timeout, calling fresh election
2018-12-18 21:51:44.947037 7feeb15f5700  1 mon.mds1mgs1-la@0(probing).data_health(716) service_dispatch not in quorum -- drop message
2018-12-18 21:51:44.947167 7feeb15f5700  0 log_channel(cluster) log [INF] : mon.mds1mgs1-la calling new monitor election
2018-12-18 21:51:44.947266 7feeb15f5700  1 mon.mds1mgs1-la@0(electing).elector(716) init, last seen epoch 716
2018-12-18 21:51:49.947965 7feeb1df6700  0 log_channel(cluster) log [INF] : mon.mds1mgs1-la@0 won leader election with quorum 0,2
2018-12-18 21:51:49.949429 7feeb1df6700  0 log_channel(cluster) log [INF] : HEALTH_WARN; 1 mons down, quorum 0,2 mds1mgs1-la,mds3mgs3-la
2018-12-18 21:51:49.952545 7feeb40da700  0 log_channel(cluster) log [INF] : monmap e2: 3 mons at {mds1mgs1-la=10.0.40.43:6789/0,mds2mgs2-la=10.0.40.44:6789/0,mds3mgs3-la=10.0.40.45:6789/0}
2018-12-18 21:51:49.952597 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526263: 5120 pgs: 2 active+clean+scrubbing+deep, 5118 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 
GB avail; 0 B/s rd, 1774 kB/s wr, 126 op/s
2018-12-18 21:51:49.952648 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e250: 1/1/1 up {0=mds3mgs3-la=up:active}, 1 up:standby
2018-12-18 21:51:49.952737 7feeb40da700  0 log_channel(cluster) log [INF] : osdmap e21859: 54 osds: 54 up, 54 in

###### ceph.conf #######
[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
public_network = 10.0.40.0/23
cluster_network = 10.0.42.0/23
mon_host = 10.0.40.43,10.0.40.44,10.0.40.45
mon_initial_members = mds1mgs1-la, mds2mgs2-la, mds3mgs3-la
fsid = 96e9619a-4828-4700-989a-fcf152286758
; Disabled debug 04.12.2015
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

[osd]
        journal_dio = true
        journal_aio = true
        osd_journal = /var/lib/ceph/osd/$cluster-$id-journal/journal
        osd_journal_size = 2048     ; journal size, in megabytes
	osd crush update on start = false
        osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
        osd_op_threads = 5
        osd_disk_threads = 4
        osd_pool_default_size = 2
        osd_pool_default_min_size = 1
        osd_pool_default_pg_num = 512
        osd_pool_default_pgp_num = 512
        osd_crush_chooseleaf_type = 1
        ; osd pool_default_crush_rule = 1
	; new options 04.12.2015
	filestore_op_threads = 4
        osd_op_num_threads_per_shard = 1
        osd_op_num_shards = 25
        filestore_fd_cache_size = 64
        filestore_fd_cache_shards = 32
	filestore_fiemap = true
	; Reduce impact of scrub (needs cfq on osds)
	osd_disk_thread_ioprio_class = "idle"
	osd_disk_thread_ioprio_priority = 7
	osd_deep_scrub_interval = 1211600
        osd_scrub_begin_hour = 19
        osd_scrub_end_hour = 4
        osd_scrub_sleep = 0.1
[client]
	rbd_cache = true
	rbd_cache_size = 67108864
	rbd_cache_max_dirty = 50331648
	rbd_cache_target_dirty = 33554432
	rbd_cache_max_dirty_age = 2
	rbd_cache_writethrough_until_flush = true

[mds]
	mds_data =  /var/lib/ceph/mds/mds.$id
	keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring
	mds_cache_size = 4000000
[mds.mds1mgs1-la]
	host = mds1mgs1-la
[mds.mds2mgs2-la]
        host = mds2mgs2-la
[mds.mds3mgs3-la]
	host = mds3mgs3-la

[mon.mds1mgs1-la]
        host = mds1mgs1-la
        mon_addr = 10.0.40.43:6789
[mon.mds2mgs2-la]
        host = mds2mgs2-la
        mon_addr = 10.0.40.44:6789
[mon.mds3mgs3-la]
        host = mds3mgs3-la
        mon_addr = 10.0.40.45:6789

 ##### MDS Log mds1mgs1-la ######
2018-12-18 21:46:26.417766 7f8b567bc700  0 monclient: hunting for new mon
2018-12-18 21:46:43.627767 7f8b4b45d700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/709483473 pipe(0x457dd800 sd=175 :6801 s=2 pgs=113031 cs=1 l=0 c=0x4264780).fault with nothing to send, going to standby
2018-12-18 21:46:43.627847 7f8b47922700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.35:0/2633350234 pipe(0x4658d800 sd=203 :6801 s=2 pgs=86438 cs=1 l=0 c=0x4265ee0).fault with nothing to send, going to standby
2018-12-18 21:46:43.627788 7f8b4af58700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/374692697 pipe(0x13862000 sd=179 :6801 s=2 pgs=6 cs=1 l=0 c=0x4264200).fault with nothing to send, going to standby
2018-12-18 21:46:43.627862 7f8b49942700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.11:0/3711089666 pipe(0x45d4c800 sd=187 :6801 s=2 pgs=140 cs=1 l=0 c=0x4265d80).fault with nothing to send, going to standby
2018-12-18 21:46:43.627790 7f8b4b059700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.26:0/4267973111 pipe(0x13850000 sd=178 :6801 s=2 pgs=23145 cs=1 l=0 c=0x4264360).fault with nothing to send, going to standby
2018-12-18 21:46:43.627925 7f8b4a851700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/1267883720 pipe(0x74ec800 sd=182 :6801 s=2 pgs=16223 cs=1 l=0 c=0x4264fc0).fault with nothing to send, going to standby
2018-12-18 21:46:43.627886 7f8b4b25b700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/1116724829 pipe(0x13859000 sd=176 :6801 s=2 pgs=112792 cs=1 l=0 c=0x4264620).fault with nothing to send, going to standby
2018-12-18 21:46:43.627775 7f8b4b760700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1077717237 pipe(0x457d4800 sd=172 :6801 s=2 pgs=293897 cs=1 l=0 c=0x4263c80).fault with nothing to send, going to standby
2018-12-18 21:46:43.627806 7f8b4a54e700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/1534369235 pipe(0x74fa000 sd=184 :6801 s=2 pgs=1222 cs=1 l=0 c=0x4264d00).fault with nothing to send, going to standby
2018-12-18 21:46:43.627987 7f8b49b44700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.19:0/2122220387 pipe(0x45d51000 sd=186 :6801 s=2 pgs=118 cs=1 l=0 c=0x4264a40).fault with nothing to send, going to standby
2018-12-18 21:46:43.627845 7f8b4a952700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/3432607794 pipe(0x74f1000 sd=181 :6801 s=2 pgs=39 cs=1 l=0 c=0x4265120).fault with nothing to send, going to standby
2018-12-18 21:46:43.628013 7f8b4a750700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/609876274 pipe(0x74e8000 sd=183 :6801 s=2 pgs=336 cs=1 l=0 c=0x4264e60).fault with nothing to send, going to standby
2018-12-18 21:46:43.627798 7f8b4b15a700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/2666666539 pipe(0x13854800 sd=177 :6801 s=2 pgs=22227 cs=1 l=0 c=0x42644c0).fault with nothing to send, going to standby
2018-12-18 21:46:43.628023 7f8b4b65f700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/876706598 pipe(0x457d0000 sd=173 :6801 s=2 pgs=4 cs=1 l=0 c=0x4263b20).fault with nothing to send, going to standby
2018-12-18 21:46:43.628057 7f8b4b55e700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.26:0/3586271069 pipe(0x457e2000 sd=174 :6801 s=2 pgs=85 cs=1 l=0 c=0x42648e0).fault with nothing to send, going to standby
2018-12-18 21:46:43.628103 7f8b4953e700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/1819363804 pipe(0x45d5a000 sd=189 :6801 s=2 pgs=396 cs=1 l=0 c=0x4265ac0).fault with nothing to send, going to standby
2018-12-18 21:46:43.628088 7f8b49740700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/3035404774 pipe(0x45d48000 sd=188 :6801 s=2 pgs=25316 cs=1 l=0 c=0x4265c20).fault with nothing to send, going to standby
2018-12-18 21:46:43.628214 7f8b4ad56700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/3173084073 pipe(0x1385d800 sd=180 :6801 s=2 pgs=4464 cs=1 l=0 c=0x4265280).fault with nothing to send, going to standby
2018-12-18 21:46:43.633248 7f8b4d17a700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/2766556290 pipe(0x6e28000 sd=159 :6801 s=2 pgs=64381 cs=1 l=0 c=0x42627e0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633283 7f8b4cb74700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1370204836 pipe(0x46c15800 sd=165 :6801 s=2 pgs=188 cs=1 l=0 c=0x42639c0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633319 7f8b4cf78700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/3500267372 pipe(0x46c11000 sd=161 :6801 s=2 pgs=260230 cs=1 l=0 c=0x4262940).fault with nothing to send, going to standby
2018-12-18 21:46:43.633442 7f8b4c66f700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/4098901061 pipe(0x2903d800 sd=170 :6801 s=2 pgs=5 cs=1 l=0 c=0x4263f40).fault with nothing to send, going to standby
2018-12-18 21:46:43.633400 7f8b4933c700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1326737308 pipe(0x45d55800 sd=190 :6801 s=2 pgs=4 cs=1 l=0 c=0x4265960).fault with nothing to send, going to standby
2018-12-18 21:46:43.633423 7f8b4ca73700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/453871943 pipe(0x29039000 sd=166 :6801 s=2 pgs=5 cs=1 l=0 c=0x4263860).fault with nothing to send, going to standby
2018-12-18 21:46:43.633319 7f8b4cd76700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2687385587 pipe(0x46c08000 sd=163 :6801 s=2 pgs=22601 cs=1 l=0 c=0x4263180).fault with nothing to send, going to standby
2018-12-18 21:46:43.633514 7f8b4c871700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/1782172062 pipe(0x29030000 sd=168 :6801 s=2 pgs=5 cs=1 l=0 c=0x42635a0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633524 7f8b4ce77700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2869272101 pipe(0x46c0c800 sd=162 :6801 s=2 pgs=4 cs=1 l=0 c=0x42632e0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633531 7f8b48f38700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2907939723 pipe(0x45d7a000 sd=194 :6801 s=2 pgs=80 cs=1 l=0 c=0x42653e0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633539 7f8b4913a700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/3324654654 pipe(0x45d6c800 sd=192 :6801 s=2 pgs=5 cs=1 l=0 c=0x42656a0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633550 7f8b4c36c700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/1001337686 pipe(0x457d9000 sd=171 :6801 s=2 pgs=214 cs=1 l=0 c=0x4263de0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633466 7f8b4c770700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/2783726776 pipe(0x29042000 sd=169 :6801 s=2 pgs=5 cs=1 l=0 c=0x42640a0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633588 7f8b47720700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.28:0/804580006 pipe(0x45418000 sd=204 :6801 s=2 pgs=1211 cs=1 l=0 c=0x42677a0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633444 7f8b4cc75700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2140342456 pipe(0x46c1a000 sd=164 :6801 s=2 pgs=295850 cs=1 l=0 c=0x4263020).fault with nothing to send, going to standby
2018-12-18 21:46:43.633625 7f8b48a33700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.13:0/579849542 pipe(0x46589000 sd=199 :6801 s=2 pgs=113 cs=1 l=0 c=0x4266460).fault with nothing to send, going to standby
2018-12-18 21:46:43.633560 7f8b4c972700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/508560318 pipe(0x29034800 sd=167 :6801 s=2 pgs=3096 cs=1 l=0 c=0x4263700).fault with nothing to send, going to standby
2018-12-18 21:46:43.633595 7f8b4d079700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/3393904046 pipe(0x6e35800 sd=160 :6801 s=2 pgs=259566 cs=1 l=0 c=0x4262aa0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633661 7f8b48c35700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.19:0/3685570052 pipe(0x45425800 sd=197 :6801 s=2 pgs=4 cs=1 l=0 c=0x4266720).fault with nothing to send, going to standby
2018-12-18 21:46:43.633661 7f8b48e37700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/1377539271 pipe(0x45d75800 sd=195 :6801 s=2 pgs=62322 cs=1 l=0 c=0x42669e0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633684 7f8b4a44d700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.12:0/2906542395 pipe(0x74f5800 sd=185 :6801 s=2 pgs=4250 cs=1 l=0 c=0x4264ba0).fault with nothing to send, going to standby
2018-12-18 21:46:43.633912 7f8b47f28700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.33:0/1412785410 pipe(0x46584800 sd=200 :6801 s=2 pgs=3909776 cs=1 l=0 c=0x4266300).fault with nothing to send, going to 
standby
2018-12-18 21:46:43.634010 7f8b47e27700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.33:0/1924210955 pipe(0x46580000 sd=201 :6801 s=2 pgs=703145 cs=1 l=0 c=0x42661a0).fault with nothing to send, going to standby
2018-12-18 21:46:43.635479 7f8b47b24700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.34:0/983518683 pipe(0x46592000 sd=202 :6801 s=2 pgs=4665529 cs=1 l=0 c=0x4266040).fault with nothing to send, going to standby
2018-12-18 21:46:44.464463 7f8b49039700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/381241131 pipe(0x45d68000 sd=193 :6801 s=2 pgs=4 cs=1 l=0 c=0x4265540).fault with nothing to send, going to standby
2018-12-18 21:46:44.680590 7f8b4923b700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/3850008269 pipe(0x45d71000 sd=191 :6801 s=2 pgs=40 cs=1 l=0 c=0x4265800).fault with nothing to send, going to standby
2018-12-18 21:46:46.106564 7f8b567bc700  1 mds.-1.-1 handle_mds_map i (10.0.40.43:6801/7037) dne in the mdsmap, respawning myself
2018-12-18 21:46:46.106578 7f8b567bc700  1 mds.-1.-1 respawn
2018-12-18 21:46:46.106581 7f8b567bc700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
2018-12-18 21:46:46.106599 7f8b567bc700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
2018-12-18 21:46:46.106601 7f8b567bc700  1 mds.-1.-1  1: '-i'
2018-12-18 21:46:46.106603 7f8b567bc700  1 mds.-1.-1  2: 'mds1mgs1-la'
2018-12-18 21:46:46.106604 7f8b567bc700  1 mds.-1.-1  3: '--pid-file'
2018-12-18 21:46:46.106605 7f8b567bc700  1 mds.-1.-1  4: '/var/run/ceph/mds.mds1mgs1-la.pid'
2018-12-18 21:46:46.106606 7f8b567bc700  1 mds.-1.-1  5: '-c'
2018-12-18 21:46:46.106607 7f8b567bc700  1 mds.-1.-1  6: '/etc/ceph/ceph.conf'
2018-12-18 21:46:46.106608 7f8b567bc700  1 mds.-1.-1  7: '--cluster'
2018-12-18 21:46:46.106609 7f8b567bc700  1 mds.-1.-1  8: 'ceph'

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com