That can happen if both a mon and an mds fail at the same time; this is a common reason to avoid co-locating the mons with the mds. Or when doing a controlled shutdown: take the mds down first and only take the mon down once the mds state settled. (I think it shouldn't take 3 minutes for it to settle when you take both offline at the same time) -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Wed, Dec 19, 2018 at 9:45 AM Alex Litvak <alexander.v.litvak@xxxxxxxxx> wrote: > > Hello everyone, > > I am running mds + mon on 3 nodes. Recently due to increased cache pressure and NUMA non-interleave effect, we decided to double the memory on the nodes from 32 G to 64 G. > We wanted to upgrade a standby node first to be able to test new memory vendor. So without much thinking (I know now :-| ), I initiated a shutdown of a server node with standby mds / second mon. > That triggered mon election, and original primary mon mds1mgs1-la still won a leadership. However at the same time everything became slow (high IO), and some cephfs clients couldn't get to file > system affecting production VMs and Containers. > > Eventually mds on host mds1mgs1-la respawned itself and remaining mds on mds3mgs3-la box became an active one. So what I am trying to understand is why perfectly good mds had to respawn (i.e. why > monitors stopped seeing it) and if it is possible to avoid it in the future. Also why on earth it took ~ 3 minutes for the failover procedure, cpu and network for very low during the upgrade time. > > Below I posted some relevant logs and ceph config > > Thank you and sorry for the messed up post. > > ####### mon log mds1mgs1-la ################### > > 2018-12-18 21:46:22.161172 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525969: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 > GB avail; 4074 B/s rd, 1958 kB/s wr, 219 op/s > 2018-12-18 21:46:23.166556 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525970: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 > GB avail; 0 B/s rd, 1195 kB/s wr, 181 op/s > 2018-12-18 21:46:23.585466 7feeb57aa700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch > 2018-12-18 21:46:23.585564 7feeb57aa700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished > 2018-12-18 21:46:24.258633 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525971: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 > GB avail; 0 B/s rd, 5199 kB/s wr, 628 op/s > 2018-12-18 21:46:24.381022 7feeb15f5700 0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({"dumpcontents": ["summary"], "prefix": "pg dump", "format": "json-pretty"} v 0) v1 > 2018-12-18 21:46:24.381056 7feeb15f5700 0 log_channel(audit) log [DBG] : from='client.? 10.0.40.43:0/544239607' entity='client.admin' cmd=[{"dumpcontents": ["summary"], "prefix": "pg dump", "format": > "json-pretty"}]: dispatch > 2018-12-18 21:46:26.418475 7feeaf2ee700 0 -- 10.0.40.43:6789/0 >> 10.0.40.44:6789/0 pipe(0x47d9000 sd=14 :25424 s=2 pgs=14249204 cs=1 l=0 c=0x3f38dc0).fault with nothing to send, going to standby > 2018-12-18 21:46:30.695896 7feea3acf700 0 -- 10.0.40.43:6789/0 >> 10.0.41.34:0/983518683 pipe(0x8a01000 sd=89 :6789 s=0 pgs=0 cs=0 l=0 c=0x50935a0).accept peer addr is really 10.0.41.34:0/983518683 > (socket is 10.0.41.34:59932/0) > 2018-12-18 21:46:34.268583 7feeb15f5700 0 log_channel(cluster) log [INF] : mon.mds1mgs1-la calling new monitor election > 2018-12-18 21:46:34.268900 7feeb15f5700 1 mon.mds1mgs1-la@0(electing).elector(715) init, last seen epoch 715 > 2018-12-18 21:46:36.128372 7feeb1df6700 0 mon.mds1mgs1-la@0(electing).data_health(714) update_stats avail 97% total 224 GB, used 6800 MB, avail 217 GB > 2018-12-18 21:46:39.269607 7feeb1df6700 0 log_channel(cluster) log [INF] : mon.mds1mgs1-la@0 won leader election with quorum 0,2 > 2018-12-18 21:46:39.271499 7feeb1df6700 0 log_channel(cluster) log [INF] : HEALTH_WARN; 1 mons down, quorum 0,2 mds1mgs1-la,mds3mgs3-la > 2018-12-18 21:46:39.275145 7feeb40da700 0 log_channel(cluster) log [INF] : monmap e2: 3 mons at {mds1mgs1-la=10.0.40.43:6789/0,mds2mgs2-la=10.0.40.44:6789/0,mds3mgs3-la=10.0.40.45:6789/0} > 2018-12-18 21:46:39.275221 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525972: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 > GB avail; 0 B/s rd, 5262 kB/s wr, 547 op/s > 2018-12-18 21:46:39.275286 7feeb40da700 0 log_channel(cluster) log [INF] : mdsmap e245: 1/1/1 up {0=mds1mgs1-la=up:active}, 2 up:standby > 2018-12-18 21:46:39.286476 7feeb40da700 0 log_channel(cluster) log [INF] : osdmap e21858: 54 osds: 54 up, 54 in > 2018-12-18 21:46:39.296721 7feeb15f5700 0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({ " p r e f i x " : " d f " , " f o r m a t " : " j s o n " } v 0) v1 > 2018-12-18 21:46:39.296910 7feeb15f5700 0 log_channel(audit) log [DBG] : from='client.? 10.0.40.12:0/3145148323' entity='client.cinder' cmd=[{,",p,r,e,f,i,x,",:,",d,f,",,, > ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch > 2018-12-18 21:46:40.291878 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525973: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 > GB avail; 6165 B/s rd, 2711 kB/s wr, 168 op/s > 2018-12-18 21:46:41.288438 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525974: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 > GB avail; 6197 B/s rd, 2683 kB/s wr, 171 op/s > 2018-12-18 21:46:42.295113 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525975: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 > GB avail; 0 B/s rd, 1319 kB/s wr, 140 op/s > 2018-12-18 21:46:43.302902 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525976: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 > GB avail; 0 B/s rd, 883 kB/s wr, 98 op/s > 2018-12-18 21:46:43.623445 7feeb40da700 1 mon.mds1mgs1-la@0(leader).osd e21859 e21859: 54 osds: 54 up, 54 in > 2018-12-18 21:46:43.624693 7feeb40da700 0 log_channel(cluster) log [INF] : osdmap e21859: 54 osds: 54 up, 54 in > 2018-12-18 21:46:43.632999 7feeb40da700 0 mon.mds1mgs1-la@0(leader).mds e246 print_map > epoch 246 > flags 0 > created 2016-01-10 23:27:39.568443 > modified 2018-12-18 21:46:43.609035 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > last_failure 246 > last_failure_osd_epoch 21859 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in > omap,8=no anchor table} > max_mds 1 > in 0 > up {0=51598182} > failed > stopped > data_pools 12 > metadata_pool 13 > inline_data disabled > 51598182: 10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:replay seq 1 > > 2018-12-18 21:46:43.633577 7feeb40da700 0 log_channel(cluster) log [INF] : mdsmap e246: 1/1/1 up {0=mds3mgs3-la=up:replay} > 2018-12-18 21:46:43.633679 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525977: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 > GB avail; 12285 B/s rd, 614 kB/s wr, 63 op/s > 2018-12-18 21:46:44.656204 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525978: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 > GB avail; 60392 kB/s rd, 666 kB/s wr, 1374 op/s > 2018-12-18 21:46:45.688973 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525979: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 105 MB/s rd, 200 > kB/s wr, 1920 op/s > 2018-12-18 21:46:46.731680 7feeb40da700 0 mon.mds1mgs1-la@0(leader).mds e247 print_map > epoch 247 > flags 0 > created 2016-01-10 23:27:39.568443 > modified 2018-12-18 21:46:46.707496 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > last_failure 246 > last_failure_osd_epoch 21859 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in > omap,8=no anchor table} > max_mds 1 > in 0 > up {0=51598182} > failed > stopped > data_pools 12 > metadata_pool 13 > inline_data disabled > 51598182: 10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:reconnect seq 5081381 > > 2018-12-18 21:46:46.732969 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525980: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 110 MB/s rd, 607 > kB/s wr, 1878 op/s > 2018-12-18 21:46:46.733022 7feeb40da700 0 log_channel(cluster) log [INF] : mds.0 10.0.40.45:6800/1216 up:reconnect > 2018-12-18 21:46:46.733057 7feeb40da700 0 log_channel(cluster) log [INF] : mdsmap e247: 1/1/1 up {0=mds3mgs3-la=up:reconnect} > 2018-12-18 21:46:47.753771 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525981: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 69803 kB/s rd, 1016 > kB/s wr, 1678 op/s > 2018-12-18 21:46:48.778505 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525982: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 28116 kB/s rd, 774 > kB/s wr, 1184 op/s > 2018-12-18 21:46:49.776763 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525983: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 4297 kB/s rd, 2518 > kB/s wr, 447 op/s > 2018-12-18 21:46:50.778331 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525984: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 3006 kB/s > wr, 129 op/s > 2018-12-18 21:46:51.788283 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525985: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 29185 kB/s rd, 2460 > kB/s wr, 398 op/s > 2018-12-18 21:46:51.791274 7feeb40da700 0 mon.mds1mgs1-la@0(leader).mds e248 print_map > epoch 248 > flags 0 > created 2016-01-10 23:27:39.568443 > modified 2018-12-18 21:46:51.782581 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > last_failure 246 > last_failure_osd_epoch 21859 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in > omap,8=no anchor table} > max_mds 1 > in 0 > up {0=51598182} > failed > stopped > data_pools 12 > metadata_pool 13 > inline_data disabled > 60478022: 10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1 > 51598182: 10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:reconnect seq 5081381 > 2018-12-18 21:46:51.791943 7feeb40da700 0 log_channel(cluster) log [INF] : mds.? 10.0.40.43:6802/4233 up:boot > 2018-12-18 21:46:51.791977 7feeb40da700 0 log_channel(cluster) log [INF] : mdsmap e248: 1/1/1 up {0=mds3mgs3-la=up:reconnect}, 1 up:standby > 2018-12-18 21:46:52.115683 7feeb15f5700 0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({ " p r e f i x " : " d f " , " f o r m a t " : " j s o n " } v 0) v1 > 2018-12-18 21:46:52.115991 7feeb15f5700 0 log_channel(audit) log [DBG] : from='client.? 10.0.40.11:0/4016495961' entity='client.cinder' cmd=[{,",p,r,e,f,i,x,",:,",d,f,",,, > ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch > 2018-12-18 21:46:52.796679 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525986: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 29097 kB/s rd, 15577 > kB/s wr, 591 op/s > 2018-12-18 21:46:53.804900 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525987: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 62900 B/s rd, 15174 > kB/s wr, 444 op/s > 2018-12-18 21:46:54.813038 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525988: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 5529 kB/s rd, 3437 > kB/s wr, 447 op/s > 2018-12-18 21:46:55.821929 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525989: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 5702 kB/s rd, 21137 > kB/s wr, 2534 op/s > 2018-12-18 21:46:56.823179 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525990: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 244 kB/s rd, 19310 > kB/s wr, 2344 op/s > 2018-12-18 21:46:57.830129 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525991: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 42795 B/s rd, 477 > kB/s wr, 75 op/s > 2018-12-18 21:46:58.833844 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525992: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8152 B/s rd, 494 > kB/s wr, 62 op/s > 2018-12-18 21:46:59.838118 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525993: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8161 B/s rd, 6026 > kB/s wr, 543 op/s > 2018-12-18 21:47:00.840107 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525994: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 6318 kB/s > wr, 531 op/s > 2018-12-18 21:47:01.843219 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525995: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 1047 kB/s > wr, 93 op/s > 2018-12-18 21:47:02.849617 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525996: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 272 kB/s rd, 5716 > kB/s wr, 168 op/s > 2018-12-18 21:47:03.858165 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525997: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 299 kB/s rd, 5737 > kB/s wr, 155 op/s > 2018-12-18 21:47:04.861351 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525998: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 28513 B/s rd, 6531 > kB/s wr, 558 op/s > 2018-12-18 21:47:05.868278 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109525999: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8148 B/s rd, 6701 > kB/s wr, 545 op/s > 2018-12-18 21:47:06.872694 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526000: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8148 B/s rd, 989 > kB/s wr, 99 op/s > 2018-12-18 21:47:07.876537 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526001: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 12235 B/s rd, 1400 > kB/s wr, 115 op/s > 2018-12-18 21:47:08.881359 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526002: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 16316 B/s rd, 1728 > kB/s wr, 143 op/s > 2018-12-18 21:47:09.884708 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526003: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 6121 B/s rd, 5107 > kB/s wr, 633 op/s > 2018-12-18 21:47:10.892664 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526004: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8143 B/s rd, 5238 > kB/s wr, 627 op/s > 2018-12-18 21:47:11.896609 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526005: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 18325 B/s rd, 1464 > kB/s wr, 177 op/s > 2018-12-18 21:47:12.902995 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526006: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 36657 B/s rd, 4254 > kB/s wr, 171 op/s > 2018-12-18 21:47:12.905463 7feeb40da700 0 mon.mds1mgs1-la@0(leader).mds e249 print_map > epoch 249 > flags 0 > created 2016-01-10 23:27:39.568443 > modified 2018-12-18 21:47:12.900465 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > last_failure 246 > last_failure_osd_epoch 21859 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in > omap,8=no anchor table} > max_mds 1 > in 0 > up {0=51598182} > failed > stopped > data_pools 12 > metadata_pool 13 > inline_data disabled > 60478022: 10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1 > 51598182: 10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:rejoin seq 5081388 > ... > 2018-12-18 21:49:15.836800 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526128: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38346 GB / 87053 > GB avail; 0 B/s rd, 5510 kB/s wr, 598 op/s > 2018-12-18 21:49:16.842000 7feeb40da700 0 mon.mds1mgs1-la@0(leader).mds e250 print_map > epoch 250 > flags 0 > created 2016-01-10 23:27:39.568443 > modified 2018-12-18 21:49:16.837045 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > last_failure 246 > last_failure_osd_epoch 21859 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in > omap,8=no anchor table} > max_mds 1 > in 0 > up {0=51598182} > failed > stopped > data_pools 12 > metadata_pool 13 > inline_data disabled > 60478022: 10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1 > 51598182: 10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:active seq 5081420 > > 2018-12-18 21:49:16.842643 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526129: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38346 GB / 87053 > GB avail; 0 B/s rd, 1145 kB/s wr, 76 op/s > 2018-12-18 21:49:16.842685 7feeb40da700 0 log_channel(cluster) log [INF] : mds.0 10.0.40.45:6800/1216 up:active > 2018-12-18 21:49:16.842709 7feeb40da700 0 log_channel(cluster) log [INF] : mdsmap e250: 1/1/1 up {0=mds3mgs3-la=up:active}, 1 up:standby > ... > > 2018-12-18 21:51:36.129356 7feeb1df6700 0 mon.mds1mgs1-la@0(leader).data_health(716) update_stats avail 97% total 224 GB, used 6801 MB, avail 217 GB > 2018-12-18 21:51:41.669590 7feeb1df6700 1 mon.mds1mgs1-la@0(leader).paxos(paxos updating c 225352570..225353159) accept timeout, calling fresh election > 2018-12-18 21:51:44.947037 7feeb15f5700 1 mon.mds1mgs1-la@0(probing).data_health(716) service_dispatch not in quorum -- drop message > 2018-12-18 21:51:44.947167 7feeb15f5700 0 log_channel(cluster) log [INF] : mon.mds1mgs1-la calling new monitor election > 2018-12-18 21:51:44.947266 7feeb15f5700 1 mon.mds1mgs1-la@0(electing).elector(716) init, last seen epoch 716 > 2018-12-18 21:51:49.947965 7feeb1df6700 0 log_channel(cluster) log [INF] : mon.mds1mgs1-la@0 won leader election with quorum 0,2 > 2018-12-18 21:51:49.949429 7feeb1df6700 0 log_channel(cluster) log [INF] : HEALTH_WARN; 1 mons down, quorum 0,2 mds1mgs1-la,mds3mgs3-la > 2018-12-18 21:51:49.952545 7feeb40da700 0 log_channel(cluster) log [INF] : monmap e2: 3 mons at {mds1mgs1-la=10.0.40.43:6789/0,mds2mgs2-la=10.0.40.44:6789/0,mds3mgs3-la=10.0.40.45:6789/0} > 2018-12-18 21:51:49.952597 7feeb40da700 0 log_channel(cluster) log [INF] : pgmap v109526263: 5120 pgs: 2 active+clean+scrubbing+deep, 5118 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053 > GB avail; 0 B/s rd, 1774 kB/s wr, 126 op/s > 2018-12-18 21:51:49.952648 7feeb40da700 0 log_channel(cluster) log [INF] : mdsmap e250: 1/1/1 up {0=mds3mgs3-la=up:active}, 1 up:standby > 2018-12-18 21:51:49.952737 7feeb40da700 0 log_channel(cluster) log [INF] : osdmap e21859: 54 osds: 54 up, 54 in > > > > ###### ceph.conf ####### > [global] > auth_service_required = cephx > filestore_xattr_use_omap = true > auth_client_required = cephx > auth_cluster_required = cephx > public_network = 10.0.40.0/23 > cluster_network = 10.0.42.0/23 > mon_host = 10.0.40.43,10.0.40.44,10.0.40.45 > mon_initial_members = mds1mgs1-la, mds2mgs2-la, mds3mgs3-la > fsid = 96e9619a-4828-4700-989a-fcf152286758 > ; Disabled debug 04.12.2015 > debug lockdep = 0/0 > debug context = 0/0 > debug crush = 0/0 > debug buffer = 0/0 > debug timer = 0/0 > debug journaler = 0/0 > debug osd = 0/0 > debug optracker = 0/0 > debug objclass = 0/0 > debug filestore = 0/0 > debug journal = 0/0 > debug ms = 0/0 > debug monc = 0/0 > debug tp = 0/0 > debug auth = 0/0 > debug finisher = 0/0 > debug heartbeatmap = 0/0 > debug perfcounter = 0/0 > debug asok = 0/0 > debug throttle = 0/0 > > [osd] > journal_dio = true > journal_aio = true > osd_journal = /var/lib/ceph/osd/$cluster-$id-journal/journal > osd_journal_size = 2048 ; journal size, in megabytes > osd crush update on start = false > osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M" > osd_op_threads = 5 > osd_disk_threads = 4 > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 512 > osd_pool_default_pgp_num = 512 > osd_crush_chooseleaf_type = 1 > ; osd pool_default_crush_rule = 1 > ; new options 04.12.2015 > filestore_op_threads = 4 > osd_op_num_threads_per_shard = 1 > osd_op_num_shards = 25 > filestore_fd_cache_size = 64 > filestore_fd_cache_shards = 32 > filestore_fiemap = true > ; Reduce impact of scrub (needs cfq on osds) > osd_disk_thread_ioprio_class = "idle" > osd_disk_thread_ioprio_priority = 7 > osd_deep_scrub_interval = 1211600 > osd_scrub_begin_hour = 19 > osd_scrub_end_hour = 4 > osd_scrub_sleep = 0.1 > [client] > rbd_cache = true > rbd_cache_size = 67108864 > rbd_cache_max_dirty = 50331648 > rbd_cache_target_dirty = 33554432 > rbd_cache_max_dirty_age = 2 > rbd_cache_writethrough_until_flush = true > > [mds] > mds_data = /var/lib/ceph/mds/mds.$id > keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring > mds_cache_size = 4000000 > [mds.mds1mgs1-la] > host = mds1mgs1-la > [mds.mds2mgs2-la] > host = mds2mgs2-la > [mds.mds3mgs3-la] > host = mds3mgs3-la > > [mon.mds1mgs1-la] > host = mds1mgs1-la > mon_addr = 10.0.40.43:6789 > [mon.mds2mgs2-la] > host = mds2mgs2-la > mon_addr = 10.0.40.44:6789 > [mon.mds3mgs3-la] > host = mds3mgs3-la > mon_addr = 10.0.40.45:6789 > > ##### MDS Log mds1mgs1-la ###### > 2018-12-18 21:46:26.417766 7f8b567bc700 0 monclient: hunting for new mon > 2018-12-18 21:46:43.627767 7f8b4b45d700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/709483473 pipe(0x457dd800 sd=175 :6801 s=2 pgs=113031 cs=1 l=0 c=0x4264780).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627847 7f8b47922700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.35:0/2633350234 pipe(0x4658d800 sd=203 :6801 s=2 pgs=86438 cs=1 l=0 c=0x4265ee0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627788 7f8b4af58700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/374692697 pipe(0x13862000 sd=179 :6801 s=2 pgs=6 cs=1 l=0 c=0x4264200).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627862 7f8b49942700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.11:0/3711089666 pipe(0x45d4c800 sd=187 :6801 s=2 pgs=140 cs=1 l=0 c=0x4265d80).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627790 7f8b4b059700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.26:0/4267973111 pipe(0x13850000 sd=178 :6801 s=2 pgs=23145 cs=1 l=0 c=0x4264360).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627925 7f8b4a851700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/1267883720 pipe(0x74ec800 sd=182 :6801 s=2 pgs=16223 cs=1 l=0 c=0x4264fc0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627886 7f8b4b25b700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/1116724829 pipe(0x13859000 sd=176 :6801 s=2 pgs=112792 cs=1 l=0 c=0x4264620).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627775 7f8b4b760700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1077717237 pipe(0x457d4800 sd=172 :6801 s=2 pgs=293897 cs=1 l=0 c=0x4263c80).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627806 7f8b4a54e700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/1534369235 pipe(0x74fa000 sd=184 :6801 s=2 pgs=1222 cs=1 l=0 c=0x4264d00).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627987 7f8b49b44700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.19:0/2122220387 pipe(0x45d51000 sd=186 :6801 s=2 pgs=118 cs=1 l=0 c=0x4264a40).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627845 7f8b4a952700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/3432607794 pipe(0x74f1000 sd=181 :6801 s=2 pgs=39 cs=1 l=0 c=0x4265120).fault with nothing to send, going to standby > 2018-12-18 21:46:43.628013 7f8b4a750700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/609876274 pipe(0x74e8000 sd=183 :6801 s=2 pgs=336 cs=1 l=0 c=0x4264e60).fault with nothing to send, going to standby > 2018-12-18 21:46:43.627798 7f8b4b15a700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/2666666539 pipe(0x13854800 sd=177 :6801 s=2 pgs=22227 cs=1 l=0 c=0x42644c0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.628023 7f8b4b65f700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/876706598 pipe(0x457d0000 sd=173 :6801 s=2 pgs=4 cs=1 l=0 c=0x4263b20).fault with nothing to send, going to standby > 2018-12-18 21:46:43.628057 7f8b4b55e700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.26:0/3586271069 pipe(0x457e2000 sd=174 :6801 s=2 pgs=85 cs=1 l=0 c=0x42648e0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.628103 7f8b4953e700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/1819363804 pipe(0x45d5a000 sd=189 :6801 s=2 pgs=396 cs=1 l=0 c=0x4265ac0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.628088 7f8b49740700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/3035404774 pipe(0x45d48000 sd=188 :6801 s=2 pgs=25316 cs=1 l=0 c=0x4265c20).fault with nothing to send, going to standby > 2018-12-18 21:46:43.628214 7f8b4ad56700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/3173084073 pipe(0x1385d800 sd=180 :6801 s=2 pgs=4464 cs=1 l=0 c=0x4265280).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633248 7f8b4d17a700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/2766556290 pipe(0x6e28000 sd=159 :6801 s=2 pgs=64381 cs=1 l=0 c=0x42627e0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633283 7f8b4cb74700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1370204836 pipe(0x46c15800 sd=165 :6801 s=2 pgs=188 cs=1 l=0 c=0x42639c0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633319 7f8b4cf78700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/3500267372 pipe(0x46c11000 sd=161 :6801 s=2 pgs=260230 cs=1 l=0 c=0x4262940).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633442 7f8b4c66f700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/4098901061 pipe(0x2903d800 sd=170 :6801 s=2 pgs=5 cs=1 l=0 c=0x4263f40).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633400 7f8b4933c700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1326737308 pipe(0x45d55800 sd=190 :6801 s=2 pgs=4 cs=1 l=0 c=0x4265960).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633423 7f8b4ca73700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/453871943 pipe(0x29039000 sd=166 :6801 s=2 pgs=5 cs=1 l=0 c=0x4263860).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633319 7f8b4cd76700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2687385587 pipe(0x46c08000 sd=163 :6801 s=2 pgs=22601 cs=1 l=0 c=0x4263180).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633514 7f8b4c871700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/1782172062 pipe(0x29030000 sd=168 :6801 s=2 pgs=5 cs=1 l=0 c=0x42635a0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633524 7f8b4ce77700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2869272101 pipe(0x46c0c800 sd=162 :6801 s=2 pgs=4 cs=1 l=0 c=0x42632e0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633531 7f8b48f38700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2907939723 pipe(0x45d7a000 sd=194 :6801 s=2 pgs=80 cs=1 l=0 c=0x42653e0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633539 7f8b4913a700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/3324654654 pipe(0x45d6c800 sd=192 :6801 s=2 pgs=5 cs=1 l=0 c=0x42656a0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633550 7f8b4c36c700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/1001337686 pipe(0x457d9000 sd=171 :6801 s=2 pgs=214 cs=1 l=0 c=0x4263de0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633466 7f8b4c770700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/2783726776 pipe(0x29042000 sd=169 :6801 s=2 pgs=5 cs=1 l=0 c=0x42640a0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633588 7f8b47720700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.28:0/804580006 pipe(0x45418000 sd=204 :6801 s=2 pgs=1211 cs=1 l=0 c=0x42677a0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633444 7f8b4cc75700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2140342456 pipe(0x46c1a000 sd=164 :6801 s=2 pgs=295850 cs=1 l=0 c=0x4263020).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633625 7f8b48a33700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.13:0/579849542 pipe(0x46589000 sd=199 :6801 s=2 pgs=113 cs=1 l=0 c=0x4266460).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633560 7f8b4c972700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/508560318 pipe(0x29034800 sd=167 :6801 s=2 pgs=3096 cs=1 l=0 c=0x4263700).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633595 7f8b4d079700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/3393904046 pipe(0x6e35800 sd=160 :6801 s=2 pgs=259566 cs=1 l=0 c=0x4262aa0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633661 7f8b48c35700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.19:0/3685570052 pipe(0x45425800 sd=197 :6801 s=2 pgs=4 cs=1 l=0 c=0x4266720).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633661 7f8b48e37700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/1377539271 pipe(0x45d75800 sd=195 :6801 s=2 pgs=62322 cs=1 l=0 c=0x42669e0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633684 7f8b4a44d700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.12:0/2906542395 pipe(0x74f5800 sd=185 :6801 s=2 pgs=4250 cs=1 l=0 c=0x4264ba0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.633912 7f8b47f28700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.33:0/1412785410 pipe(0x46584800 sd=200 :6801 s=2 pgs=3909776 cs=1 l=0 c=0x4266300).fault with nothing to send, going to > standby > 2018-12-18 21:46:43.634010 7f8b47e27700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.33:0/1924210955 pipe(0x46580000 sd=201 :6801 s=2 pgs=703145 cs=1 l=0 c=0x42661a0).fault with nothing to send, going to standby > 2018-12-18 21:46:43.635479 7f8b47b24700 0 -- 10.0.40.43:6801/7037 >> 10.0.41.34:0/983518683 pipe(0x46592000 sd=202 :6801 s=2 pgs=4665529 cs=1 l=0 c=0x4266040).fault with nothing to send, going to standby > 2018-12-18 21:46:44.464463 7f8b49039700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/381241131 pipe(0x45d68000 sd=193 :6801 s=2 pgs=4 cs=1 l=0 c=0x4265540).fault with nothing to send, going to standby > 2018-12-18 21:46:44.680590 7f8b4923b700 0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/3850008269 pipe(0x45d71000 sd=191 :6801 s=2 pgs=40 cs=1 l=0 c=0x4265800).fault with nothing to send, going to standby > 2018-12-18 21:46:46.106564 7f8b567bc700 1 mds.-1.-1 handle_mds_map i (10.0.40.43:6801/7037) dne in the mdsmap, respawning myself > 2018-12-18 21:46:46.106578 7f8b567bc700 1 mds.-1.-1 respawn > 2018-12-18 21:46:46.106581 7f8b567bc700 1 mds.-1.-1 e: '/usr/bin/ceph-mds' > 2018-12-18 21:46:46.106599 7f8b567bc700 1 mds.-1.-1 0: '/usr/bin/ceph-mds' > 2018-12-18 21:46:46.106601 7f8b567bc700 1 mds.-1.-1 1: '-i' > 2018-12-18 21:46:46.106603 7f8b567bc700 1 mds.-1.-1 2: 'mds1mgs1-la' > 2018-12-18 21:46:46.106604 7f8b567bc700 1 mds.-1.-1 3: '--pid-file' > 2018-12-18 21:46:46.106605 7f8b567bc700 1 mds.-1.-1 4: '/var/run/ceph/mds.mds1mgs1-la.pid' > 2018-12-18 21:46:46.106606 7f8b567bc700 1 mds.-1.-1 5: '-c' > 2018-12-18 21:46:46.106607 7f8b567bc700 1 mds.-1.-1 6: '/etc/ceph/ceph.conf' > 2018-12-18 21:46:46.106608 7f8b567bc700 1 mds.-1.-1 7: '--cluster' > 2018-12-18 21:46:46.106609 7f8b567bc700 1 mds.-1.-1 8: 'ceph' > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com