On Thu, Sep 18, 2014 at 5:35 PM, ??? <Derek at f-club.cn> wrote: > if i turn on debug=20, the log will be more than 100G, > > looks no way to put, do you have any other good way to figure it out? It should compress well and you can use ceph-post-file if you don't have a place to host it yourself. -Greg > > would you like to log into the server to check? > > > From: Gregory Farnum > Date: 2014-09-19 02:33 > To: ??? > CC: ceph-users > Subject: Re: ceph mds unable to start with 0.85 > > On Wed, Sep 17, 2014 at 9:59 PM, ??? <Derek at f-club.cn> wrote: >> dear, >> my ceph cluster worked for about two weeks, mds crashed every 2-3 >> days, >> Now it stuck on replay , looks like replay crash and restart mds process >> again >> what can i do for this? >> >> 1015 => # ceph -s >> cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18 >> health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is >> degraded; noscrub,nodeep-scrub flag(s) set >> monmap e1: 2 mons at >> {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0}, >> election >> epoch 26, quorum 0,1 storage-1-213,storage-1-214 >> mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby >> osdmap e1932: 18 osds: 18 up, 18 in >> flags noscrub,nodeep-scrub >> pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects >> 4479 GB used, 32292 GB / 36772 GB avail >> 444 active+clean >> 56 active+clean+inconsistent >> client io 125 MB/s rd, 31 op/s >> >> MDS log here: >> >> 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch >> 620 >> from mon.0 >> 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby >> 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch >> 621 >> from mon.0 >> 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am >> now >> mds.0.272 >> 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state >> change up:standby --> up:replay >> 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start >> 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set >> 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is >> 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch >> 1929, >> have 1927 >> 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap >> 1929 >> (which blacklists prior instance) >> 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for >> myself; not doing anything >> 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening >> inotable >> 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening >> sessionmap >> 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening >> mds log >> 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log >> bounds >> 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening >> snap table >> 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on >> 10.1.0.213:6806/6114 >> 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on >> 10.1.0.213:6811/6385 >> 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on >> 10.1.0.213:6801/6110 >> 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal >> 200 >> to recover... >> 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on >> 10.1.0.214:6806/6238 >> 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered. >> 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200 >> in >> format 1 >> 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1: >> loading/discovering base inodes >> 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system >> inode >> with ino:100 >> 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system >> inode >> with ino:1 >> 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on >> 10.1.0.214:6811/6276 >> 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2: >> replaying >> mds log >> 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth >> -1,-2 -> -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() >> n() >> hs=0+0,ss=0+0 0x5da0000] >> 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir >> 1 >> / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | >> subtree=1 0x5da0000] >> 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth >> -1,-2 -> -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 >> state=1073741824 >> f() n() hs=0+0,ss=0+0 0x5da03b8] >> 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir >> 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 >> | >> subtree=1 0x5da03b8] >> 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache >> adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth >> v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 >> 17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 >> b1824476527135 >> 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 >> b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 >> 0x5da0000] >> bound_dfs [] >> 2014-09-18 12:38:55.675116 7f823ca08700 7 mds.0.cache >> adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth >> v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 >> 17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 >> b1824476527135 >> 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 >> b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 >> 0x5da0000] >> bounds >> 2014-09-18 12:38:55.675129 7f823ca08700 7 mds.0.cache current root is [dir >> 1 >> / [2,head] auth v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 >> m2014-09-09 17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 >> b1824476527135 31747410=31708953+38457)/n(v87567 rc2014-09-16 >> 12:44:38.450226 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | >> subtree=1 0x5da0000] >> 2014-09-18 12:38:55.675142 7f823ca08700 7 mds.0.cache >> adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 100 ~mds0/ [2,head] auth >> v=1288520 cv=0/0 dir_auth=-2 state=1073741824 f(v0 11=1+10) n(v16575 >> rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | subtree=1 >> 0x5da03b8] bound_dfs [] >> 2014-09-18 12:38:55.675152 7f823ca08700 7 mds.0.cache >> adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 100 ~mds0/ [2,head] auth >> v=1288520 cv=0/0 dir_auth=-2 state=1073741824 f(v0 11=1+10) n(v16575 >> rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | subtree=1 >> 0x5da03b8] bounds >> 2014-09-18 12:38:55.675160 7f823ca08700 7 mds.0.cache current root is [dir >> 100 ~mds0/ [2,head] auth v=1288520 cv=0/0 dir_auth=-2 state=1073741824 >> f(v0 >> 11=1+10) n(v16575 rc2014-09-15 18:01:15.766451 b2512216 82=72+10) >> hs=0+0,ss=0+0 | subtree=1 0x5da03b8] >> 2014-09-18 12:38:55.675167 7f823ca08700 7 mds.0.cache recalc_auth_bits >> (replay) >> 2014-09-18 12:38:55.675171 7f823ca08700 7 mds.0.cache show_cache >> 2014-09-18 12:38:55.675172 7f823ca08700 7 mds.0.cache unlinked [inode 100 >> [...2,head] ~mds0/ auth v85546 snaprealm=0x3bf1d40 f(v0 11=1+10) n(v16575 >> rc2014-09-15 18:01:15.766451 b2512216 83=72+11)/n(v0 12=1+11) (iversion >> lock) | dirfrag=1 0x3d68000] >> 2014-09-18 12:38:55.675187 7f823ca08700 7 mds.0.cache dirfrag [dir 100 >> ~mds0/ [2,head] auth v=1288520 cv=0/0 dir_auth=0 state=1073741824 f(v0 >> 11=1+10) n(v16575 rc2014-09-15 18:01:15.766451 b2512216 82=72+10) >> hs=0+0,ss=0+0 | subtree=1 0x5da03b8] >> 2014-09-18 12:38:55.675207 7f823ca08700 7 mds.0.cache unlinked [inode 1 >> [...2,head] / auth v1 snaprealm=0x3bf21c0 f() n(v0 1=0+1) (iversion lock) >> | >> dirfrag=1 0x3d68938] >> 2014-09-18 12:38:55.675213 7f823ca08700 7 mds.0.cache dirfrag [dir 1 / >> [2,head] auth v=1076158 cv=0/0 dir_auth=0 state=1073741824 f(v0 >> m2014-09-09 >> 17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 >> b1824476527135 >> 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 >> b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 >> 0x5da0000] >> 2014-09-18 12:38:56.089244 7f8240512700 5 mds.0.272 ms_handle_connect on >> 10.1.0.214:6801/6078 >> 2014-09-18 12:39:08.795245 7f8240512700 5 mds.0.272 handle_mds_map epoch >> 622 >> from mon.0 >> 2014-09-18 12:49:15.677419 7f823dc0c700 5 mds.0.272 is_laggy 29.870013 > >> 15 >> since last acked beacon >> 2014-09-18 12:49:15.680801 7f823dc0c700 5 mds.0.272 tick bailing out since >> we seem laggy >> 2014-09-18 12:49:15.684033 7f8240512700 5 mds.0.272 handle_mds_map epoch >> 623 >> from mon.0 >> 2014-09-18 12:49:15.684406 7f8240512700 1 mds.-1.-1 handle_mds_map i >> (10.1.0.213:6822/17501) dne in the mdsmap, respawning myself >> 2014-09-18 12:49:15.684429 7f8240512700 1 mds.-1.-1 respawn >> 2014-09-18 12:49:15.684432 7f8240512700 1 mds.-1.-1 e: 'ceph-mds' >> 2014-09-18 12:49:15.684444 7f8240512700 1 mds.-1.-1 0: 'ceph-mds' >> 2014-09-18 12:49:15.684447 7f8240512700 1 mds.-1.-1 1: '-i' >> 2014-09-18 12:49:15.684449 7f8240512700 1 mds.-1.-1 2: 'storage-1-213' >> 2014-09-18 12:49:15.684451 7f8240512700 1 mds.-1.-1 3: '-f'2014-09-18 >> 12:49:16.475451 7f32533f77a0 0 ceph version 0.85 >> (a0c22842db9eaee9840136784e94e50fabe77187), process ceph-mds, pid 17501 >> 2014-09-18 12:49:16.487916 7f324e4f1700 5 mds.-1.0 ms_handle_connect on >> 10.1.0.213:6789/0 >> 2014-09-18 12:49:16.517705 7f324e4f1700 5 mds.-1.0 handle_mds_map epoch >> 623 >> from mon.0 >> 2014-09-18 12:49:16.900142 7f324e4f1700 5 mds.-1.-1 handle_mds_map epoch >> 624 >> from mon.0 >> 2014-09-18 12:49:16.900190 7f324e4f1700 1 mds.-1.0 handle_mds_map standby >> 2014-09-18 12:52:05.654849 7f324e4f1700 5 mds.-1.0 handle_mds_map epoch >> 625 >> from mon.0 >> 2014-09-18 12:52:05.654929 7f324e4f1700 1 mds.0.274 handle_mds_map i am >> now >> mds.0.274 >> 2014-09-18 12:52:05.654933 7f324e4f1700 1 mds.0.274 handle_mds_map state >> change up:standby --> up:replay >> 2014-09-18 12:52:05.654937 7f324e4f1700 1 mds.0.274 replay_start >> 2014-09-18 12:52:05.654954 7f324e4f1700 7 mds.0.cache set_recovery_set >> 2014-09-18 12:52:05.654960 7f324e4f1700 1 mds.0.274 recovery set is >> 2014-09-18 12:52:05.654962 7f324e4f1700 1 mds.0.274 need osdmap epoch >> 1933, >> have 1931 >> 2014-09-18 12:52:05.654965 7f324e4f1700 1 mds.0.274 waiting for osdmap >> 1933 >> (which blacklists prior instance) >> 2014-09-18 12:52:05.655022 7f324e4f1700 5 mds.0.274 handle_mds_failure for >> myself; not doing anything >> 2014-09-18 12:52:05.668471 7f324e4f1700 2 mds.0.274 boot_start 0: opening >> inotable >> 2014-09-18 12:52:05.668731 7f324e4f1700 2 mds.0.274 boot_start 0: opening >> sessionmap >> 2014-09-18 12:52:05.668878 7f324e4f1700 2 mds.0.274 boot_start 0: opening >> mds log >> 2014-09-18 12:52:05.668883 7f324e4f1700 5 mds.0.log open discovering log >> bounds > > You would need to turn up logging quite a bit (debug mds = 20, debug > objecter = 10, debug journaled = 20 should be good) and put it > somewhere accessible. From that log all anybody can say is that your > MDS is respawning on purpose, because the monitor didn't get a > liveness report from it for such a long time. > > That said, version .85 is a dev version and we have some versions in > our current Giant stabilization branch which might be in .85 as well, > but they won't be backported. It's generally best to run named > releases unless you are participating in the development and test of > new features. :) > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com