dear, my ceph cluster worked for about two weeks, mds crashed every 2-3 days, Now it stuck on replay , looks like replay crash and restart mds process again what can i do for this? 1015 => # ceph -s cluster 07df7765-c2e7-44de-9bb3-0b13f6517b18 health HEALTH_ERR 56 pgs inconsistent; 56 scrub errors; mds cluster is degraded; noscrub,nodeep-scrub flag(s) set monmap e1: 2 mons at {storage-1-213=10.1.0.213:6789/0,storage-1-214=10.1.0.214:6789/0}, election epoch 26, quorum 0,1 storage-1-213,storage-1-214 mdsmap e624: 1/1/1 up {0=storage-1-214=up:replay}, 1 up:standby osdmap e1932: 18 osds: 18 up, 18 in flags noscrub,nodeep-scrub pgmap v732381: 500 pgs, 3 pools, 2155 GB data, 39187 kobjects 4479 GB used, 32292 GB / 36772 GB avail 444 active+clean 56 active+clean+inconsistent client io 125 MB/s rd, 31 op/s MDS log here: 014-09-18 12:36:10.684841 7f8240512700 5 mds.-1.-1 handle_mds_map epoch 620 from mon.0 2014-09-18 12:36:10.684888 7f8240512700 1 mds.-1.0 handle_mds_map standby 2014-09-18 12:38:55.584370 7f8240512700 5 mds.-1.0 handle_mds_map epoch 621 from mon.0 2014-09-18 12:38:55.584432 7f8240512700 1 mds.0.272 handle_mds_map i am now mds.0.272 2014-09-18 12:38:55.584436 7f8240512700 1 mds.0.272 handle_mds_map state change up:standby --> up:replay 2014-09-18 12:38:55.584440 7f8240512700 1 mds.0.272 replay_start 2014-09-18 12:38:55.584456 7f8240512700 7 mds.0.cache set_recovery_set 2014-09-18 12:38:55.584460 7f8240512700 1 mds.0.272 recovery set is 2014-09-18 12:38:55.584464 7f8240512700 1 mds.0.272 need osdmap epoch 1929, have 1927 2014-09-18 12:38:55.584467 7f8240512700 1 mds.0.272 waiting for osdmap 1929 (which blacklists prior instance) 2014-09-18 12:38:55.584523 7f8240512700 5 mds.0.272 handle_mds_failure for myself; not doing anything 2014-09-18 12:38:55.585662 7f8240512700 2 mds.0.272 boot_start 0: opening inotable 2014-09-18 12:38:55.585864 7f8240512700 2 mds.0.272 boot_start 0: opening sessionmap 2014-09-18 12:38:55.586003 7f8240512700 2 mds.0.272 boot_start 0: opening mds log 2014-09-18 12:38:55.586049 7f8240512700 5 mds.0.log open discovering log bounds 2014-09-18 12:38:55.586136 7f8240512700 2 mds.0.272 boot_start 0: opening snap table 2014-09-18 12:38:55.586984 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6806/6114 2014-09-18 12:38:55.587037 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6811/6385 2014-09-18 12:38:55.587285 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.213:6801/6110 2014-09-18 12:38:55.591700 7f823ca08700 4 mds.0.log Waiting for journal 200 to recover... 2014-09-18 12:38:55.593297 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.214:6806/6238 2014-09-18 12:38:55.600952 7f823ca08700 4 mds.0.log Journal 200 recovered. 2014-09-18 12:38:55.600967 7f823ca08700 4 mds.0.log Recovered journal 200 in format 1 2014-09-18 12:38:55.600973 7f823ca08700 2 mds.0.272 boot_start 1: loading/discovering base inodes 2014-09-18 12:38:55.600979 7f823ca08700 0 mds.0.cache creating system inode with ino:100 2014-09-18 12:38:55.601279 7f823ca08700 0 mds.0.cache creating system inode with ino:1 2014-09-18 12:38:55.602557 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.214:6811/6276 2014-09-18 12:38:55.607234 7f8240512700 2 mds.0.272 boot_start 2: replaying mds log 2014-09-18 12:38:55.675025 7f823ca08700 7 mds.0.cache adjust_subtree_auth -1,-2 -> -2,-2 on [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x5da0000] 2014-09-18 12:38:55.675055 7f823ca08700 7 mds.0.cache current root is [dir 1 / [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 0x5da0000] 2014-09-18 12:38:55.675065 7f823ca08700 7 mds.0.cache adjust_subtree_auth -1,-2 -> -2,-2 on [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 0x5da03b8] 2014-09-18 12:38:55.675076 7f823ca08700 7 mds.0.cache current root is [dir 100 ~mds0/ [2,head] auth v=0 cv=0/0 state=1073741824 f() n() hs=0+0,ss=0+0 | subtree=1 0x5da03b8] 2014-09-18 12:38:55.675087 7f823ca08700 7 mds.0.cache adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da0000] bound_dfs [] 2014-09-18 12:38:55.675116 7f823ca08700 7 mds.0.cache adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 1 / [2,head] auth v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da0000] bounds 2014-09-18 12:38:55.675129 7f823ca08700 7 mds.0.cache current root is [dir 1 / [2,head] auth v=1076158 cv=0/0 dir_auth=-2 state=1073741824 f(v0 m2014-09-09 17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da0000] 2014-09-18 12:38:55.675142 7f823ca08700 7 mds.0.cache adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 100 ~mds0/ [2,head] auth v=1288520 cv=0/0 dir_auth=-2 state=1073741824 f(v0 11=1+10) n(v16575 rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | subtree=1 0x5da03b8] bound_dfs [] 2014-09-18 12:38:55.675152 7f823ca08700 7 mds.0.cache adjust_bounded_subtree_auth -2,-2 -> 0,-2 on [dir 100 ~mds0/ [2,head] auth v=1288520 cv=0/0 dir_auth=-2 state=1073741824 f(v0 11=1+10) n(v16575 rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | subtree=1 0x5da03b8] bounds 2014-09-18 12:38:55.675160 7f823ca08700 7 mds.0.cache current root is [dir 100 ~mds0/ [2,head] auth v=1288520 cv=0/0 dir_auth=-2 state=1073741824 f(v0 11=1+10) n(v16575 rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | subtree=1 0x5da03b8] 2014-09-18 12:38:55.675167 7f823ca08700 7 mds.0.cache recalc_auth_bits (replay) 2014-09-18 12:38:55.675171 7f823ca08700 7 mds.0.cache show_cache 2014-09-18 12:38:55.675172 7f823ca08700 7 mds.0.cache unlinked [inode 100 [...2,head] ~mds0/ auth v85546 snaprealm=0x3bf1d40 f(v0 11=1+10) n(v16575 rc2014-09-15 18:01:15.766451 b2512216 83=72+11)/n(v0 12=1+11) (iversion lock) | dirfrag=1 0x3d68000] 2014-09-18 12:38:55.675187 7f823ca08700 7 mds.0.cache dirfrag [dir 100 ~mds0/ [2,head] auth v=1288520 cv=0/0 dir_auth=0 state=1073741824 f(v0 11=1+10) n(v16575 rc2014-09-15 18:01:15.766451 b2512216 82=72+10) hs=0+0,ss=0+0 | subtree=1 0x5da03b8] 2014-09-18 12:38:55.675207 7f823ca08700 7 mds.0.cache unlinked [inode 1 [...2,head] / auth v1 snaprealm=0x3bf21c0 f() n(v0 1=0+1) (iversion lock) | dirfrag=1 0x3d68938] 2014-09-18 12:38:55.675213 7f823ca08700 7 mds.0.cache dirfrag [dir 1 / [2,head] auth v=1076158 cv=0/0 dir_auth=0 state=1073741824 f(v0 m2014-09-09 17:49:20.000000 1=0+1) n(v87567 rc2014-09-16 12:44:41.750069 b1824476527135 31747410=31708953+38457)/n(v87567 rc2014-09-16 12:44:38.450226 b1824464654503 31746894=31708437+38457) hs=0+0,ss=0+0 | subtree=1 0x5da0000] 2014-09-18 12:38:56.089244 7f8240512700 5 mds.0.272 ms_handle_connect on 10.1.0.214:6801/6078 2014-09-18 12:39:08.795245 7f8240512700 5 mds.0.272 handle_mds_map epoch 622 from mon.0 2014-09-18 12:49:15.677419 7f823dc0c700 5 mds.0.272 is_laggy 29.870013 > 15 since last acked beacon 2014-09-18 12:49:15.680801 7f823dc0c700 5 mds.0.272 tick bailing out since we seem laggy 2014-09-18 12:49:15.684033 7f8240512700 5 mds.0.272 handle_mds_map epoch 623 from mon.0 2014-09-18 12:49:15.684406 7f8240512700 1 mds.-1.-1 handle_mds_map i (10.1.0.213:6822/17501) dne in the mdsmap, respawning myself 2014-09-18 12:49:15.684429 7f8240512700 1 mds.-1.-1 respawn 2014-09-18 12:49:15.684432 7f8240512700 1 mds.-1.-1 e: 'ceph-mds' 2014-09-18 12:49:15.684444 7f8240512700 1 mds.-1.-1 0: 'ceph-mds' 2014-09-18 12:49:15.684447 7f8240512700 1 mds.-1.-1 1: '-i' 2014-09-18 12:49:15.684449 7f8240512700 1 mds.-1.-1 2: 'storage-1-213' 2014-09-18 12:49:15.684451 7f8240512700 1 mds.-1.-1 3: '-f'2014-09-18 12:49:16.475451 7f32533f77a0 0 ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187), process ceph-mds, pid 17501 2014-09-18 12:49:16.487916 7f324e4f1700 5 mds.-1.0 ms_handle_connect on 10.1.0.213:6789/0 2014-09-18 12:49:16.517705 7f324e4f1700 5 mds.-1.0 handle_mds_map epoch 623 from mon.0 2014-09-18 12:49:16.900142 7f324e4f1700 5 mds.-1.-1 handle_mds_map epoch 624 from mon.0 2014-09-18 12:49:16.900190 7f324e4f1700 1 mds.-1.0 handle_mds_map standby 2014-09-18 12:52:05.654849 7f324e4f1700 5 mds.-1.0 handle_mds_map epoch 625 from mon.0 2014-09-18 12:52:05.654929 7f324e4f1700 1 mds.0.274 handle_mds_map i am now mds.0.274 2014-09-18 12:52:05.654933 7f324e4f1700 1 mds.0.274 handle_mds_map state change up:standby --> up:replay 2014-09-18 12:52:05.654937 7f324e4f1700 1 mds.0.274 replay_start 2014-09-18 12:52:05.654954 7f324e4f1700 7 mds.0.cache set_recovery_set 2014-09-18 12:52:05.654960 7f324e4f1700 1 mds.0.274 recovery set is 2014-09-18 12:52:05.654962 7f324e4f1700 1 mds.0.274 need osdmap epoch 1933, have 1931 2014-09-18 12:52:05.654965 7f324e4f1700 1 mds.0.274 waiting for osdmap 1933 (which blacklists prior instance) 2014-09-18 12:52:05.655022 7f324e4f1700 5 mds.0.274 handle_mds_failure for myself; not doing anything 2014-09-18 12:52:05.668471 7f324e4f1700 2 mds.0.274 boot_start 0: opening inotable 2014-09-18 12:52:05.668731 7f324e4f1700 2 mds.0.274 boot_start 0: opening sessionmap 2014-09-18 12:52:05.668878 7f324e4f1700 2 mds.0.274 boot_start 0: opening mds log 2014-09-18 12:52:05.668883 7f324e4f1700 5 mds.0.log open discovering log bounds -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140918/ba22cb5f/attachment.htm>