Re: MDS crash, wont startup again

Felix Feinhals <ff@xxxxxxxxxxxxxxxxxxxxxxx> · Mon, 21 May 2012 14:38:20 +0200

Hi Josh,

i quoted the trace and some other stats in my first email, maybe it
got stuck in the spam filters.
Well next try:

snip

-3> 2012-05-10 14:52:29.509940 7fb1c9351700 1 mds.0.40 handle_mds_map
 i am now mds.0.40
 -2> 2012-05-10 14:52:29.509956 7fb1c9351700 1 mds.0.40 handle_mds_map
 state change up:reconnect --> up:rejoin
 -1> 2012-05-10 14:52:29.509963 7fb1c9351700 1 mds.0.40 rejoin_joint_start
 0> 2012-05-10 14:52:29.512503 7fb1c9351700 -1 *** Caught signal
 (Segmentation fault) **
 in thread 7fb1c9351700

ceph version 0.46 (commit:cb7f1c9c7520848b0899b26440ac34a8acea58d1)
 1: ceph-mds() [0x814279]
 2: (()+0xeff0) [0x7fb1cddbfff0]
 3: (SnapRealm::have_past_parents_open(snapid_t, snapid_t)+0x4f) [0x6cb5ef]
 4: (MDCache::check_realm_past_parents(SnapRealm*)+0x2b) [0x55d58b]
 5: (MDCache::choose_lock_states_and_reconnect_caps()+0x29c) [0x572eec]
 6: (MDCache::rejoin_gather_finish()+0x90) [0x5931a0]
 7: (MDCache::rejoin_send_rejoins()+0x2c05) [0x59b9d5]
 8: (MDS::rejoin_joint_start()+0x131) [0x4a8721]
 9: (MDS::handle_mds_map(MMDSMap*)+0x2c4a) [0x4c253a]
 10: (MDS::handle_core_message(Message*)+0x913) [0x4c4513]
 11: (MDS::_dispatch(Message*)+0x2f) [0x4c45ef]
 12: (MDS::ms_dispatch(Message*)+0x1fb) [0x4c628b]
 13: (SimpleMessenger::dispatch_entry()+0x979) [0x7acb49]
 14: (SimpleMessenger::DispatchThread::entry()+0xd) [0x7336ed]
 15: (()+0x68ca) [0x7fb1cddb78ca]
 16: (clone()+0x6d) [0x7fb1cc63f92d]

snip

I though ceph chooses which MDS is active and which is standby, i just
have 3 in the cluster config:

[mds.a]
host = x

[mds.b]
host = y

[mds.c]
host = z

no global MDS config.
Should i reconfigure this?

2012/5/17 Josh Durgin <josh.durgin@xxxxxxxxxxx>:
> On 05/16/2012 01:11 AM, Felix Feinhals wrote:
>>
>> Hi again,
>>
>> anything on this Problem? Seems that the only choice for me is to
>> reinitialize the whole cephfs (mkcephfs...)
>> :(
>
>
> Hi Felix, it looks like your first mail never reached the list.
>
>
>> 2012/5/10 Felix Feinhals<ff@xxxxxxxxxxxxxxxxxxxxxxx>:
>>>
>>> Hi List,
>>>
>>> we installed a ceph cluster with ceph version 0.46.
>>> 3 OSDs, 3 MONs and 3 MDSs.
>>>
>>> After copying a bunch of files to a ceph-fuse mount all MDS daemons
>>> crash and now i cant bring them back online.
>>> I already tried to restart the daemons in different order and also
>>> removed one OSD, nothing really happened only now we have pgs with
>>> active+remapped which i think is normal.
>>> Any hints?
>
>
> Are all three MDS active? At this point, more than one active MDS is
> likely to crash. You can have one active and others standby.
>
> If you've got only one active, what was the backtrace of the crash?
> It'll be at the end of the MDS log (by default in /var/log/ceph).
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html