Looks like your journal has some bad events in it, probably due to bugs in the multi-MDS systems. Did you start out this cluster on 67.4, or has it been upgraded at some point? Why did you use two active MDS daemons? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Mon, Oct 21, 2013 at 7:05 PM, Gagandeep Arora <aroragagan24@xxxxxxxxx> wrote: > Hello, > > We are running ceph-0.67.4 with two mds and both of the mds daemons are > crashing see the logs below: > > > [root@ceph1 ~]# ceph health detail > HEALTH_ERR mds rank 1 has failed; mds cluster is degraded; mds a is laggy > mds.1 has failed > mds cluster is degraded > mds.a at 192.168.6.101:6808/14609 rank 0 is replaying journal > mds.a at 192.168.6.101:6808/14609 is laggy/unresponsive > > > [root@ceph1 ~]# ceph mds dump > dumped mdsmap epoch 19386 > epoch 19386 > flags 0 > created 2013-03-20 08:56:13.873024 > modified 2013-10-22 11:58:31.374700 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > last_failure 19253 > last_failure_osd_epoch 6648 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds > uses versioned encoding} > max_mds 2 > in 0,1 > up {0=222230} > failed 1 > stopped > data_pools 0,13,14 > metadata_pool 1 > 222230: 192.168.6.101:6808/14609 'a' mds.0.19 up:replay seq 1 laggy since > 2013-10-22 11:55:50.972032 > > > [root@ceph1 ~]# ceph-mds -i a -d > 2013-10-22 11:55:28.093342 7f343195f7c0 0 ceph version 0.67.4 > (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7), process ceph-mds, pid 14609 > starting mds.a at :/0 > 2013-10-22 11:55:31.550871 7f342c593700 1 mds.-1.0 handle_mds_map standby > 2013-10-22 11:55:32.151652 7f342c593700 1 mds.0.19 handle_mds_map i am now > mds.0.19 > 2013-10-22 11:55:32.151658 7f342c593700 1 mds.0.19 handle_mds_map state > change up:standby --> up:replay > 2013-10-22 11:55:32.151661 7f342c593700 1 mds.0.19 replay_start > 2013-10-22 11:55:32.151673 7f342c593700 1 mds.0.19 recovery set is 1 > 2013-10-22 11:55:32.151675 7f342c593700 1 mds.0.19 need osdmap epoch 6648, > have 6647 > 2013-10-22 11:55:32.151677 7f342c593700 1 mds.0.19 waiting for osdmap 6648 > (which blacklists prior instance) > 2013-10-22 11:55:32.275413 7f342c593700 0 mds.0.cache creating system inode > with ino:100 > 2013-10-22 11:55:32.275720 7f342c593700 0 mds.0.cache creating system inode > with ino:1 > mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, > MDSlaveUpdate*)' thread 7f3428078700 time 2013-10-22 11:55:37.562600 > mds/journal.cc: 1096: FAILED assert(in->first == p->dnfirst || > (in->is_multiversion() && in->first > p->dnfirst)) > ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7) > 1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x399d) [0x65b0ad] > 2: (EUpdate::replay(MDS*)+0x3a) [0x663c0a] > 3: (MDLog::_replay_thread()+0x5cf) [0x82e17f] > 4: (MDLog::ReplayThread::entry()+0xd) [0x6393ad] > 5: (()+0x7d15) [0x7f3430fc2d15] > 6: (clone()+0x6d) [0x7f342fa3948d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > 2013-10-22 11:55:37.563382 7f3428078700 -1 mds/journal.cc: In function 'void > EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f3428078700 > time 2013-10-22 11:55:37.562600 > mds/journal.cc: 1096: FAILED assert(in->first == p->dnfirst || > (in->is_multiversion() && in->first > p->dnfirst)) > > ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7) > 1: (EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)+0x399d) [0x65b0ad] > 2: (EUpdate::replay(MDS*)+0x3a) [0x663c0a] > 3: (MDLog::_replay_thread()+0x5cf) [0x82e17f] > 4: (MDLog::ReplayThread::entry()+0xd) [0x6393ad] > 5: (()+0x7d15) [0x7f3430fc2d15] > 6: (clone()+0x6d) [0x7f342fa3948d] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com