Re: Active mds respawns itself during standby mds reboot

Paul Emmerich <paul.emmerich@xxxxxxxx> · Thu, 20 Dec 2018 18:27:15 +0100

That can happen if both a mon and an mds fail at the same time; this
is a common reason to avoid co-locating the mons with the mds.
Or when doing a controlled shutdown: take the mds down first and only
take the mon down once the mds state settled.

(I think it shouldn't take 3 minutes for it to settle when you take
both offline at the same time)

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Dec 19, 2018 at 9:45 AM Alex Litvak
<alexander.v.litvak@xxxxxxxxx> wrote:
>
> Hello everyone,
>
> I am running mds + mon on 3 nodes.  Recently due to increased cache pressure and NUMA non-interleave effect, we decided to double the memory on the nodes from 32 G to 64 G.
> We wanted to upgrade a standby node first to be able to test new memory vendor.  So without much thinking (I know now :-| ), I initiated a shutdown of a server node with standby mds / second mon.
> That triggered mon election, and original primary mon mds1mgs1-la still won a leadership.  However at the same time everything became slow (high IO), and some cephfs clients couldn't get to file
> system affecting production VMs and Containers.
>
> Eventually mds on host mds1mgs1-la respawned itself and remaining mds on mds3mgs3-la box became an active one. So what I am trying to understand is why perfectly good mds had to respawn (i.e. why
> monitors stopped seeing it) and if it is possible to avoid it in the future.  Also why on earth it took ~ 3 minutes for the failover procedure, cpu and network for very low during the upgrade time.
>
> Below I posted some relevant logs and ceph config
>
> Thank you and sorry for the messed up post.
>
> ####### mon log mds1mgs1-la ###################
>
> 2018-12-18 21:46:22.161172 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525969: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053
> GB avail; 4074 B/s rd, 1958 kB/s wr, 219 op/s
> 2018-12-18 21:46:23.166556 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525970: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053
> GB avail; 0 B/s rd, 1195 kB/s wr, 181 op/s
> 2018-12-18 21:46:23.585466 7feeb57aa700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
> 2018-12-18 21:46:23.585564 7feeb57aa700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
> 2018-12-18 21:46:24.258633 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525971: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053
> GB avail; 0 B/s rd, 5199 kB/s wr, 628 op/s
> 2018-12-18 21:46:24.381022 7feeb15f5700  0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({"dumpcontents": ["summary"], "prefix": "pg dump", "format": "json-pretty"} v 0) v1
> 2018-12-18 21:46:24.381056 7feeb15f5700  0 log_channel(audit) log [DBG] : from='client.? 10.0.40.43:0/544239607' entity='client.admin' cmd=[{"dumpcontents": ["summary"], "prefix": "pg dump", "format":
> "json-pretty"}]: dispatch
> 2018-12-18 21:46:26.418475 7feeaf2ee700  0 -- 10.0.40.43:6789/0 >> 10.0.40.44:6789/0 pipe(0x47d9000 sd=14 :25424 s=2 pgs=14249204 cs=1 l=0 c=0x3f38dc0).fault with nothing to send, going to standby
> 2018-12-18 21:46:30.695896 7feea3acf700  0 -- 10.0.40.43:6789/0 >> 10.0.41.34:0/983518683 pipe(0x8a01000 sd=89 :6789 s=0 pgs=0 cs=0 l=0 c=0x50935a0).accept peer addr is really 10.0.41.34:0/983518683
> (socket is 10.0.41.34:59932/0)
> 2018-12-18 21:46:34.268583 7feeb15f5700  0 log_channel(cluster) log [INF] : mon.mds1mgs1-la calling new monitor election
> 2018-12-18 21:46:34.268900 7feeb15f5700  1 mon.mds1mgs1-la@0(electing).elector(715) init, last seen epoch 715
> 2018-12-18 21:46:36.128372 7feeb1df6700  0 mon.mds1mgs1-la@0(electing).data_health(714) update_stats avail 97% total 224 GB, used 6800 MB, avail 217 GB
> 2018-12-18 21:46:39.269607 7feeb1df6700  0 log_channel(cluster) log [INF] : mon.mds1mgs1-la@0 won leader election with quorum 0,2
> 2018-12-18 21:46:39.271499 7feeb1df6700  0 log_channel(cluster) log [INF] : HEALTH_WARN; 1 mons down, quorum 0,2 mds1mgs1-la,mds3mgs3-la
> 2018-12-18 21:46:39.275145 7feeb40da700  0 log_channel(cluster) log [INF] : monmap e2: 3 mons at {mds1mgs1-la=10.0.40.43:6789/0,mds2mgs2-la=10.0.40.44:6789/0,mds3mgs3-la=10.0.40.45:6789/0}
> 2018-12-18 21:46:39.275221 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525972: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053
> GB avail; 0 B/s rd, 5262 kB/s wr, 547 op/s
> 2018-12-18 21:46:39.275286 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e245: 1/1/1 up {0=mds1mgs1-la=up:active}, 2 up:standby
> 2018-12-18 21:46:39.286476 7feeb40da700  0 log_channel(cluster) log [INF] : osdmap e21858: 54 osds: 54 up, 54 in
> 2018-12-18 21:46:39.296721 7feeb15f5700  0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({ " p r e f i x " : " d f " ,   " f o r m a t " : " j s o n " } v 0) v1
> 2018-12-18 21:46:39.296910 7feeb15f5700  0 log_channel(audit) log [DBG] : from='client.? 10.0.40.12:0/3145148323' entity='client.cinder' cmd=[{,",p,r,e,f,i,x,",:,",d,f,",,,
> ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch
> 2018-12-18 21:46:40.291878 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525973: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053
> GB avail; 6165 B/s rd, 2711 kB/s wr, 168 op/s
> 2018-12-18 21:46:41.288438 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525974: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053
> GB avail; 6197 B/s rd, 2683 kB/s wr, 171 op/s
> 2018-12-18 21:46:42.295113 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525975: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053
> GB avail; 0 B/s rd, 1319 kB/s wr, 140 op/s
> 2018-12-18 21:46:43.302902 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525976: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053
> GB avail; 0 B/s rd, 883 kB/s wr, 98 op/s
> 2018-12-18 21:46:43.623445 7feeb40da700  1 mon.mds1mgs1-la@0(leader).osd e21859 e21859: 54 osds: 54 up, 54 in
> 2018-12-18 21:46:43.624693 7feeb40da700  0 log_channel(cluster) log [INF] : osdmap e21859: 54 osds: 54 up, 54 in
> 2018-12-18 21:46:43.632999 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e246 print_map
> epoch   246
> flags   0
> created 2016-01-10 23:27:39.568443
> modified        2018-12-18 21:46:43.609035
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> last_failure    246
> last_failure_osd_epoch  21859
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,8=no anchor table}
> max_mds 1
> in      0
> up      {0=51598182}
> failed
> stopped
> data_pools      12
> metadata_pool   13
> inline_data     disabled
> 51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:replay seq 1
>
> 2018-12-18 21:46:43.633577 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e246: 1/1/1 up {0=mds3mgs3-la=up:replay}
> 2018-12-18 21:46:43.633679 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525977: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053
> GB avail; 12285 B/s rd, 614 kB/s wr, 63 op/s
> 2018-12-18 21:46:44.656204 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525978: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053
> GB avail; 60392 kB/s rd, 666 kB/s wr, 1374 op/s
> 2018-12-18 21:46:45.688973 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525979: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 105 MB/s rd, 200
> kB/s wr, 1920 op/s
> 2018-12-18 21:46:46.731680 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e247 print_map
> epoch   247
> flags   0
> created 2016-01-10 23:27:39.568443
> modified        2018-12-18 21:46:46.707496
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> last_failure    246
> last_failure_osd_epoch  21859
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,8=no anchor table}
> max_mds 1
> in      0
> up      {0=51598182}
> failed
> stopped
> data_pools      12
> metadata_pool   13
> inline_data     disabled
> 51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:reconnect seq 5081381
>
> 2018-12-18 21:46:46.732969 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525980: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 110 MB/s rd, 607
> kB/s wr, 1878 op/s
> 2018-12-18 21:46:46.733022 7feeb40da700  0 log_channel(cluster) log [INF] : mds.0 10.0.40.45:6800/1216 up:reconnect
> 2018-12-18 21:46:46.733057 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e247: 1/1/1 up {0=mds3mgs3-la=up:reconnect}
> 2018-12-18 21:46:47.753771 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525981: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 69803 kB/s rd, 1016
> kB/s wr, 1678 op/s
> 2018-12-18 21:46:48.778505 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525982: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 28116 kB/s rd, 774
> kB/s wr, 1184 op/s
> 2018-12-18 21:46:49.776763 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525983: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 4297 kB/s rd, 2518
> kB/s wr, 447 op/s
> 2018-12-18 21:46:50.778331 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525984: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 3006 kB/s
> wr, 129 op/s
> 2018-12-18 21:46:51.788283 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525985: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 29185 kB/s rd, 2460
> kB/s wr, 398 op/s
> 2018-12-18 21:46:51.791274 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e248 print_map
> epoch   248
> flags   0
> created 2016-01-10 23:27:39.568443
> modified        2018-12-18 21:46:51.782581
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> last_failure    246
> last_failure_osd_epoch  21859
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,8=no anchor table}
> max_mds 1
> in      0
> up      {0=51598182}
> failed
> stopped
> data_pools      12
> metadata_pool   13
> inline_data     disabled
> 60478022:       10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1
> 51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:reconnect seq 5081381
> 2018-12-18 21:46:51.791943 7feeb40da700  0 log_channel(cluster) log [INF] : mds.? 10.0.40.43:6802/4233 up:boot
> 2018-12-18 21:46:51.791977 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e248: 1/1/1 up {0=mds3mgs3-la=up:reconnect}, 1 up:standby
> 2018-12-18 21:46:52.115683 7feeb15f5700  0 mon.mds1mgs1-la@0(leader) e2 handle_command mon_command({ " p r e f i x " : " d f " ,   " f o r m a t " : " j s o n " } v 0) v1
> 2018-12-18 21:46:52.115991 7feeb15f5700  0 log_channel(audit) log [DBG] : from='client.? 10.0.40.11:0/4016495961' entity='client.cinder' cmd=[{,",p,r,e,f,i,x,",:,",d,f,",,,
> ,",f,o,r,m,a,t,",:,",j,s,o,n,",}]: dispatch
> 2018-12-18 21:46:52.796679 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525986: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 29097 kB/s rd, 15577
> kB/s wr, 591 op/s
> 2018-12-18 21:46:53.804900 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525987: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 62900 B/s rd, 15174
> kB/s wr, 444 op/s
> 2018-12-18 21:46:54.813038 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525988: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 5529 kB/s rd, 3437
> kB/s wr, 447 op/s
> 2018-12-18 21:46:55.821929 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525989: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 5702 kB/s rd, 21137
> kB/s wr, 2534 op/s
> 2018-12-18 21:46:56.823179 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525990: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 244 kB/s rd, 19310
> kB/s wr, 2344 op/s
> 2018-12-18 21:46:57.830129 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525991: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 42795 B/s rd, 477
> kB/s wr, 75 op/s
> 2018-12-18 21:46:58.833844 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525992: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8152 B/s rd, 494
> kB/s wr, 62 op/s
> 2018-12-18 21:46:59.838118 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525993: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8161 B/s rd, 6026
> kB/s wr, 543 op/s
> 2018-12-18 21:47:00.840107 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525994: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 6318 kB/s
> wr, 531 op/s
> 2018-12-18 21:47:01.843219 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525995: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 0 B/s rd, 1047 kB/s
> wr, 93 op/s
> 2018-12-18 21:47:02.849617 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525996: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 272 kB/s rd, 5716
> kB/s wr, 168 op/s
> 2018-12-18 21:47:03.858165 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525997: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 299 kB/s rd, 5737
> kB/s wr, 155 op/s
> 2018-12-18 21:47:04.861351 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525998: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 28513 B/s rd, 6531
> kB/s wr, 558 op/s
> 2018-12-18 21:47:05.868278 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109525999: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8148 B/s rd, 6701
> kB/s wr, 545 op/s
> 2018-12-18 21:47:06.872694 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526000: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8148 B/s rd, 989
> kB/s wr, 99 op/s
> 2018-12-18 21:47:07.876537 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526001: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 12235 B/s rd, 1400
> kB/s wr, 115 op/s
> 2018-12-18 21:47:08.881359 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526002: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 16316 B/s rd, 1728
> kB/s wr, 143 op/s
> 2018-12-18 21:47:09.884708 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526003: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 6121 B/s rd, 5107
> kB/s wr, 633 op/s
> 2018-12-18 21:47:10.892664 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526004: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 8143 B/s rd, 5238
> kB/s wr, 627 op/s
> 2018-12-18 21:47:11.896609 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526005: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 18325 B/s rd, 1464
> kB/s wr, 177 op/s
> 2018-12-18 21:47:12.902995 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526006: 5120 pgs: 5120 active+clean; 16236 GB data, 48708 GB used, 38345 GB / 87053 GB avail; 36657 B/s rd, 4254
> kB/s wr, 171 op/s
> 2018-12-18 21:47:12.905463 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e249 print_map
> epoch   249
> flags   0
> created 2016-01-10 23:27:39.568443
> modified        2018-12-18 21:47:12.900465
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> last_failure    246
> last_failure_osd_epoch  21859
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,8=no anchor table}
> max_mds 1
> in      0
> up      {0=51598182}
> failed
> stopped
> data_pools      12
> metadata_pool   13
> inline_data     disabled
> 60478022:       10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1
> 51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:rejoin seq 5081388
> ...
> 2018-12-18 21:49:15.836800 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526128: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38346 GB / 87053
> GB avail; 0 B/s rd, 5510 kB/s wr, 598 op/s
> 2018-12-18 21:49:16.842000 7feeb40da700  0 mon.mds1mgs1-la@0(leader).mds e250 print_map
> epoch   250
> flags   0
> created 2016-01-10 23:27:39.568443
> modified        2018-12-18 21:49:16.837045
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> last_failure    246
> last_failure_osd_epoch  21859
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in
> omap,8=no anchor table}
> max_mds 1
> in      0
> up      {0=51598182}
> failed
> stopped
> data_pools      12
> metadata_pool   13
> inline_data     disabled
> 60478022:       10.0.40.43:6802/4233 'mds1mgs1-la' mds.-1.0 up:standby seq 1
> 51598182:       10.0.40.45:6800/1216 'mds3mgs3-la' mds.0.24 up:active seq 5081420
>
> 2018-12-18 21:49:16.842643 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526129: 5120 pgs: 1 active+clean+scrubbing+deep, 5119 active+clean; 16236 GB data, 48707 GB used, 38346 GB / 87053
> GB avail; 0 B/s rd, 1145 kB/s wr, 76 op/s
> 2018-12-18 21:49:16.842685 7feeb40da700  0 log_channel(cluster) log [INF] : mds.0 10.0.40.45:6800/1216 up:active
> 2018-12-18 21:49:16.842709 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e250: 1/1/1 up {0=mds3mgs3-la=up:active}, 1 up:standby
> ...
>
> 2018-12-18 21:51:36.129356 7feeb1df6700  0 mon.mds1mgs1-la@0(leader).data_health(716) update_stats avail 97% total 224 GB, used 6801 MB, avail 217 GB
> 2018-12-18 21:51:41.669590 7feeb1df6700  1 mon.mds1mgs1-la@0(leader).paxos(paxos updating c 225352570..225353159) accept timeout, calling fresh election
> 2018-12-18 21:51:44.947037 7feeb15f5700  1 mon.mds1mgs1-la@0(probing).data_health(716) service_dispatch not in quorum -- drop message
> 2018-12-18 21:51:44.947167 7feeb15f5700  0 log_channel(cluster) log [INF] : mon.mds1mgs1-la calling new monitor election
> 2018-12-18 21:51:44.947266 7feeb15f5700  1 mon.mds1mgs1-la@0(electing).elector(716) init, last seen epoch 716
> 2018-12-18 21:51:49.947965 7feeb1df6700  0 log_channel(cluster) log [INF] : mon.mds1mgs1-la@0 won leader election with quorum 0,2
> 2018-12-18 21:51:49.949429 7feeb1df6700  0 log_channel(cluster) log [INF] : HEALTH_WARN; 1 mons down, quorum 0,2 mds1mgs1-la,mds3mgs3-la
> 2018-12-18 21:51:49.952545 7feeb40da700  0 log_channel(cluster) log [INF] : monmap e2: 3 mons at {mds1mgs1-la=10.0.40.43:6789/0,mds2mgs2-la=10.0.40.44:6789/0,mds3mgs3-la=10.0.40.45:6789/0}
> 2018-12-18 21:51:49.952597 7feeb40da700  0 log_channel(cluster) log [INF] : pgmap v109526263: 5120 pgs: 2 active+clean+scrubbing+deep, 5118 active+clean; 16236 GB data, 48707 GB used, 38345 GB / 87053
> GB avail; 0 B/s rd, 1774 kB/s wr, 126 op/s
> 2018-12-18 21:51:49.952648 7feeb40da700  0 log_channel(cluster) log [INF] : mdsmap e250: 1/1/1 up {0=mds3mgs3-la=up:active}, 1 up:standby
> 2018-12-18 21:51:49.952737 7feeb40da700  0 log_channel(cluster) log [INF] : osdmap e21859: 54 osds: 54 up, 54 in
>
>
>
> ###### ceph.conf #######
> [global]
> auth_service_required = cephx
> filestore_xattr_use_omap = true
> auth_client_required = cephx
> auth_cluster_required = cephx
> public_network = 10.0.40.0/23
> cluster_network = 10.0.42.0/23
> mon_host = 10.0.40.43,10.0.40.44,10.0.40.45
> mon_initial_members = mds1mgs1-la, mds2mgs2-la, mds3mgs3-la
> fsid = 96e9619a-4828-4700-989a-fcf152286758
> ; Disabled debug 04.12.2015
> debug lockdep = 0/0
> debug context = 0/0
> debug crush = 0/0
> debug buffer = 0/0
> debug timer = 0/0
> debug journaler = 0/0
> debug osd = 0/0
> debug optracker = 0/0
> debug objclass = 0/0
> debug filestore = 0/0
> debug journal = 0/0
> debug ms = 0/0
> debug monc = 0/0
> debug tp = 0/0
> debug auth = 0/0
> debug finisher = 0/0
> debug heartbeatmap = 0/0
> debug perfcounter = 0/0
> debug asok = 0/0
> debug throttle = 0/0
>
> [osd]
>          journal_dio = true
>          journal_aio = true
>          osd_journal = /var/lib/ceph/osd/$cluster-$id-journal/journal
>          osd_journal_size = 2048     ; journal size, in megabytes
>         osd crush update on start = false
>          osd mount options xfs = "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
>          osd_op_threads = 5
>          osd_disk_threads = 4
>          osd_pool_default_size = 2
>          osd_pool_default_min_size = 1
>          osd_pool_default_pg_num = 512
>          osd_pool_default_pgp_num = 512
>          osd_crush_chooseleaf_type = 1
>          ; osd pool_default_crush_rule = 1
>         ; new options 04.12.2015
>         filestore_op_threads = 4
>          osd_op_num_threads_per_shard = 1
>          osd_op_num_shards = 25
>          filestore_fd_cache_size = 64
>          filestore_fd_cache_shards = 32
>         filestore_fiemap = true
>         ; Reduce impact of scrub (needs cfq on osds)
>         osd_disk_thread_ioprio_class = "idle"
>         osd_disk_thread_ioprio_priority = 7
>         osd_deep_scrub_interval = 1211600
>          osd_scrub_begin_hour = 19
>          osd_scrub_end_hour = 4
>          osd_scrub_sleep = 0.1
> [client]
>         rbd_cache = true
>         rbd_cache_size = 67108864
>         rbd_cache_max_dirty = 50331648
>         rbd_cache_target_dirty = 33554432
>         rbd_cache_max_dirty_age = 2
>         rbd_cache_writethrough_until_flush = true
>
> [mds]
>         mds_data =  /var/lib/ceph/mds/mds.$id
>         keyring = /var/lib/ceph/mds/mds.$id/mds.$id.keyring
>         mds_cache_size = 4000000
> [mds.mds1mgs1-la]
>         host = mds1mgs1-la
> [mds.mds2mgs2-la]
>          host = mds2mgs2-la
> [mds.mds3mgs3-la]
>         host = mds3mgs3-la
>
> [mon.mds1mgs1-la]
>          host = mds1mgs1-la
>          mon_addr = 10.0.40.43:6789
> [mon.mds2mgs2-la]
>          host = mds2mgs2-la
>          mon_addr = 10.0.40.44:6789
> [mon.mds3mgs3-la]
>          host = mds3mgs3-la
>          mon_addr = 10.0.40.45:6789
>
>   ##### MDS Log mds1mgs1-la ######
> 2018-12-18 21:46:26.417766 7f8b567bc700  0 monclient: hunting for new mon
> 2018-12-18 21:46:43.627767 7f8b4b45d700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/709483473 pipe(0x457dd800 sd=175 :6801 s=2 pgs=113031 cs=1 l=0 c=0x4264780).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627847 7f8b47922700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.35:0/2633350234 pipe(0x4658d800 sd=203 :6801 s=2 pgs=86438 cs=1 l=0 c=0x4265ee0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627788 7f8b4af58700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/374692697 pipe(0x13862000 sd=179 :6801 s=2 pgs=6 cs=1 l=0 c=0x4264200).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627862 7f8b49942700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.11:0/3711089666 pipe(0x45d4c800 sd=187 :6801 s=2 pgs=140 cs=1 l=0 c=0x4265d80).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627790 7f8b4b059700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.26:0/4267973111 pipe(0x13850000 sd=178 :6801 s=2 pgs=23145 cs=1 l=0 c=0x4264360).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627925 7f8b4a851700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/1267883720 pipe(0x74ec800 sd=182 :6801 s=2 pgs=16223 cs=1 l=0 c=0x4264fc0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627886 7f8b4b25b700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/1116724829 pipe(0x13859000 sd=176 :6801 s=2 pgs=112792 cs=1 l=0 c=0x4264620).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627775 7f8b4b760700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1077717237 pipe(0x457d4800 sd=172 :6801 s=2 pgs=293897 cs=1 l=0 c=0x4263c80).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627806 7f8b4a54e700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/1534369235 pipe(0x74fa000 sd=184 :6801 s=2 pgs=1222 cs=1 l=0 c=0x4264d00).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627987 7f8b49b44700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.19:0/2122220387 pipe(0x45d51000 sd=186 :6801 s=2 pgs=118 cs=1 l=0 c=0x4264a40).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627845 7f8b4a952700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/3432607794 pipe(0x74f1000 sd=181 :6801 s=2 pgs=39 cs=1 l=0 c=0x4265120).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.628013 7f8b4a750700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/609876274 pipe(0x74e8000 sd=183 :6801 s=2 pgs=336 cs=1 l=0 c=0x4264e60).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.627798 7f8b4b15a700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/2666666539 pipe(0x13854800 sd=177 :6801 s=2 pgs=22227 cs=1 l=0 c=0x42644c0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.628023 7f8b4b65f700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/876706598 pipe(0x457d0000 sd=173 :6801 s=2 pgs=4 cs=1 l=0 c=0x4263b20).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.628057 7f8b4b55e700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.26:0/3586271069 pipe(0x457e2000 sd=174 :6801 s=2 pgs=85 cs=1 l=0 c=0x42648e0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.628103 7f8b4953e700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/1819363804 pipe(0x45d5a000 sd=189 :6801 s=2 pgs=396 cs=1 l=0 c=0x4265ac0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.628088 7f8b49740700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/3035404774 pipe(0x45d48000 sd=188 :6801 s=2 pgs=25316 cs=1 l=0 c=0x4265c20).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.628214 7f8b4ad56700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/3173084073 pipe(0x1385d800 sd=180 :6801 s=2 pgs=4464 cs=1 l=0 c=0x4265280).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633248 7f8b4d17a700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/2766556290 pipe(0x6e28000 sd=159 :6801 s=2 pgs=64381 cs=1 l=0 c=0x42627e0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633283 7f8b4cb74700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1370204836 pipe(0x46c15800 sd=165 :6801 s=2 pgs=188 cs=1 l=0 c=0x42639c0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633319 7f8b4cf78700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/3500267372 pipe(0x46c11000 sd=161 :6801 s=2 pgs=260230 cs=1 l=0 c=0x4262940).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633442 7f8b4c66f700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/4098901061 pipe(0x2903d800 sd=170 :6801 s=2 pgs=5 cs=1 l=0 c=0x4263f40).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633400 7f8b4933c700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/1326737308 pipe(0x45d55800 sd=190 :6801 s=2 pgs=4 cs=1 l=0 c=0x4265960).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633423 7f8b4ca73700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/453871943 pipe(0x29039000 sd=166 :6801 s=2 pgs=5 cs=1 l=0 c=0x4263860).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633319 7f8b4cd76700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2687385587 pipe(0x46c08000 sd=163 :6801 s=2 pgs=22601 cs=1 l=0 c=0x4263180).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633514 7f8b4c871700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/1782172062 pipe(0x29030000 sd=168 :6801 s=2 pgs=5 cs=1 l=0 c=0x42635a0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633524 7f8b4ce77700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2869272101 pipe(0x46c0c800 sd=162 :6801 s=2 pgs=4 cs=1 l=0 c=0x42632e0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633531 7f8b48f38700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2907939723 pipe(0x45d7a000 sd=194 :6801 s=2 pgs=80 cs=1 l=0 c=0x42653e0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633539 7f8b4913a700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/3324654654 pipe(0x45d6c800 sd=192 :6801 s=2 pgs=5 cs=1 l=0 c=0x42656a0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633550 7f8b4c36c700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.28:0/1001337686 pipe(0x457d9000 sd=171 :6801 s=2 pgs=214 cs=1 l=0 c=0x4263de0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633466 7f8b4c770700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.8:0/2783726776 pipe(0x29042000 sd=169 :6801 s=2 pgs=5 cs=1 l=0 c=0x42640a0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633588 7f8b47720700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.28:0/804580006 pipe(0x45418000 sd=204 :6801 s=2 pgs=1211 cs=1 l=0 c=0x42677a0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633444 7f8b4cc75700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/2140342456 pipe(0x46c1a000 sd=164 :6801 s=2 pgs=295850 cs=1 l=0 c=0x4263020).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633625 7f8b48a33700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.13:0/579849542 pipe(0x46589000 sd=199 :6801 s=2 pgs=113 cs=1 l=0 c=0x4266460).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633560 7f8b4c972700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.27:0/508560318 pipe(0x29034800 sd=167 :6801 s=2 pgs=3096 cs=1 l=0 c=0x4263700).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633595 7f8b4d079700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/3393904046 pipe(0x6e35800 sd=160 :6801 s=2 pgs=259566 cs=1 l=0 c=0x4262aa0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633661 7f8b48c35700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.19:0/3685570052 pipe(0x45425800 sd=197 :6801 s=2 pgs=4 cs=1 l=0 c=0x4266720).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633661 7f8b48e37700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/1377539271 pipe(0x45d75800 sd=195 :6801 s=2 pgs=62322 cs=1 l=0 c=0x42669e0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633684 7f8b4a44d700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.12:0/2906542395 pipe(0x74f5800 sd=185 :6801 s=2 pgs=4250 cs=1 l=0 c=0x4264ba0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.633912 7f8b47f28700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.33:0/1412785410 pipe(0x46584800 sd=200 :6801 s=2 pgs=3909776 cs=1 l=0 c=0x4266300).fault with nothing to send, going to
> standby
> 2018-12-18 21:46:43.634010 7f8b47e27700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.33:0/1924210955 pipe(0x46580000 sd=201 :6801 s=2 pgs=703145 cs=1 l=0 c=0x42661a0).fault with nothing to send, going to standby
> 2018-12-18 21:46:43.635479 7f8b47b24700  0 -- 10.0.40.43:6801/7037 >> 10.0.41.34:0/983518683 pipe(0x46592000 sd=202 :6801 s=2 pgs=4665529 cs=1 l=0 c=0x4266040).fault with nothing to send, going to standby
> 2018-12-18 21:46:44.464463 7f8b49039700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.39:0/381241131 pipe(0x45d68000 sd=193 :6801 s=2 pgs=4 cs=1 l=0 c=0x4265540).fault with nothing to send, going to standby
> 2018-12-18 21:46:44.680590 7f8b4923b700  0 -- 10.0.40.43:6801/7037 >> 10.0.40.40:0/3850008269 pipe(0x45d71000 sd=191 :6801 s=2 pgs=40 cs=1 l=0 c=0x4265800).fault with nothing to send, going to standby
> 2018-12-18 21:46:46.106564 7f8b567bc700  1 mds.-1.-1 handle_mds_map i (10.0.40.43:6801/7037) dne in the mdsmap, respawning myself
> 2018-12-18 21:46:46.106578 7f8b567bc700  1 mds.-1.-1 respawn
> 2018-12-18 21:46:46.106581 7f8b567bc700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
> 2018-12-18 21:46:46.106599 7f8b567bc700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
> 2018-12-18 21:46:46.106601 7f8b567bc700  1 mds.-1.-1  1: '-i'
> 2018-12-18 21:46:46.106603 7f8b567bc700  1 mds.-1.-1  2: 'mds1mgs1-la'
> 2018-12-18 21:46:46.106604 7f8b567bc700  1 mds.-1.-1  3: '--pid-file'
> 2018-12-18 21:46:46.106605 7f8b567bc700  1 mds.-1.-1  4: '/var/run/ceph/mds.mds1mgs1-la.pid'
> 2018-12-18 21:46:46.106606 7f8b567bc700  1 mds.-1.-1  5: '-c'
> 2018-12-18 21:46:46.106607 7f8b567bc700  1 mds.-1.-1  6: '/etc/ceph/ceph.conf'
> 2018-12-18 21:46:46.106608 7f8b567bc700  1 mds.-1.-1  7: '--cluster'
> 2018-12-18 21:46:46.106609 7f8b567bc700  1 mds.-1.-1  8: 'ceph'
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com