ceph 0.78 mon and mds crashing (bus error)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



----- Message from Joao Eduardo Luis <joao.luis at inktank.com> ---------
    Date: Tue, 01 Apr 2014 13:33:05 +0100
    From: Joao Eduardo Luis <joao.luis at inktank.com>
Subject: Re: ceph 0.78 mon and mds crashing (bus error)
      To: Kenneth Waegeman <Kenneth.Waegeman at UGent.be>, ceph-users  
<ceph-users at lists.ceph.com>


> On 04/01/2014 01:17 PM, Kenneth Waegeman wrote:
>> Hi all,
>>
>> We have installed Ceph 0.78 on 3 hosts running SL6, each serving OSD and
>> MON daemons, 2 of them running MDS in active/backup. Since today the MON
>> and MDS daemons are crashing every time (after a short while). Rebooting
>> the nodes didn't help. The cluster was running for 1-2 weeks without
>> this problem. This is what's in the log files:
>>
>
> Something is going wrong on leveldb.  What leveldb version are you using?

This is leveldb-1.7.0-2.el6.x86_64
>
>   -Joao
>
>>
>>
>> 2014-04-01 12:48:09.442819 7fa39cb1d700  0 log [INF] : mdsmap e67: 1/1/1
>> up {0=ceph001.cubone.os=up:active(laggy or crashed)}
>> 2014-04-01 12:48:09.442897 7fa39cb1d700  1
>> mon.ceph001 at 0(leader).paxos(paxos active c 504009..504758) is_readable
>> now=2014-04-01 12:48:09.442898 lease_expire=2014-04-01 12:48:14.442854
>> has v0 lc 504758
>> 2014-04-01 12:48:15.340419 7fa39d51e700  1
>> mon.ceph001 at 0(leader).paxos(paxos updating c 504009..504758) is_readable
>> now=2014-04-01 12:48:15.340420 lease_expire=2014-04-01 12:48:14.442854
>> has v0 lc 504758
>> 2014-04-01 12:48:15.340438 7fa39d51e700  1
>> mon.ceph001 at 0(leader).paxos(paxos updating c 504009..504758) is_readable
>> now=2014-04-01 12:48:15.340439 lease_expire=2014-04-01 12:48:14.442854
>> has v0 lc 504758
>> 2014-04-01 12:48:15.340443 7fa39d51e700  1
>> mon.ceph001 at 0(leader).paxos(paxos updating c 504009..504758) is_readable
>> now=2014-04-01 12:48:15.340444 lease_expire=2014-04-01 12:48:14.442854
>> has v0 lc 504758
>> 2014-04-01 12:48:22.260674 7fa39d51e700  1
>> mon.ceph001 at 0(leader).paxos(paxos active c 504009..504759) is_readable
>> now=2014-04-01 12:48:22.260676 lease_expire=2014-04-01 12:48:27.260625
>> has v0 lc 504759
>> 2014-04-01 12:48:22.260688 7fa39d51e700  1
>> mon.ceph001 at 0(leader).paxos(paxos active c 504009..504759) is_readable
>> now=2014-04-01 12:48:22.260689 lease_expire=2014-04-01 12:48:27.260625
>> has v0 lc 504759
>> 2014-04-01 12:48:22.260703 7fa39d51e700  1
>> mon.ceph001 at 0(leader).paxos(paxos active c 504009..504759) is_readable
>> now=2014-04-01 12:48:22.260704 lease_expire=2014-04-01 12:48:27.260625
>> has v0 lc 504759
>> 2014-04-01 12:48:22.292363 7fa39cb1d700  0 log [INF] : mon.ceph001
>> calling new monitor election
>> 2014-04-01 12:48:22.292428 7fa39cb1d700  1
>> mon.ceph001 at 0(electing).elector(43) init, la2014-04-01 12:48:27.450606
>> 7fa39d51e700 -1 *** Caught signal (Bus error) **
>>  in thread 7fa39d51e700
>>
>>  ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
>>  1: /usr/bin/ceph-mon() [0x85e619]
>>  2: /lib64/libpthread.so.0() [0x34c360f710]
>>  3: (memcpy()+0x15b) [0x34c3289aab]
>>  4: (()+0x3d4b2) [0x7fa3a0ae54b2]
>>  5: (leveldb::log::Writer::EmitPhysicalRecord(leveldb::log::RecordType,
>> char const*, unsigned long)+0x125) [0x7fa3a0acbe85]
>>  6: (leveldb::log::Writer::AddRecord(leveldb::Slice const&)+0xe2)
>> [0x7fa3a0acc092]
>>  7: (leveldb::DBImpl::Write(leveldb::WriteOptions const&,
>> leveldb::WriteBatch*)+0x38e) [0x7fa3a0ac0bde]
>>  8:
>> (LevelDBStore::submit_transaction_sync(std::tr1::shared_ptr<KeyValueDB::TransactionImpl>)+0x2b)
>> [0x7c63eb]
>>  9: (MonitorDBStore::apply_transaction(MonitorDBStore::Transaction
>> const&)+0x183) [0x53c6b3]
>>  10: (Paxos::begin(ceph::buffer::list&)+0x604) [0x5a2ca4]
>>  11: (Paxos::propose_queued()+0x273) [0x5a3773]
>>  12: (Paxos::propose_new_value(ceph::buffer::list&, Context*)+0x160)
>> [0x5a3960]
>>  13: (PaxosService::propose_pending()+0x386) [0x5ad946]
>>  14: (Context::complete(int)+0x9) [0x580609]
>>  15: (SafeTimer::timer_thread()+0x453) [0x768a33]
>>  16: (SafeTimerThread::entry()+0xd) [0x76abed]
>>  17: /lib64/libpthread.so.0() [0x34c36079d1]
>>  18: (clone()+0x6d) [0x34c32e8b6d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>> --- begin dump of recent events ---
>> -10000> 2014-04-01 12:37:41.891458 7fa39cb1d700  1 --
>> 10.141.8.180:6789/0 --> mon.2 10.141.8.182:6789/0 -- paxos(lease lc
>> 504309 fc 503758 pn 0 opn 0) v3 -- ?+0 0x6246e00
>>  -9999> 2014-04-01 12:37:41.891476 7fa39cb1d700  1 --
>> 10.141.8.180:6789/0 <== mon.0 10.141.8.180:6789/0 0 ==== log(1 entries)
>> v1 ==== 0+0+0 (0 0 0) 0x29f4380 con 0x2050840
>>  -9998> 2014-04-01 12:37:41.891487 7fa39cb1d700  1
>> mon.ceph001 at 0(leader).paxos(paxos active c 503758..504309) is_readable
>> now=2014-04-01 12:37:41.891488 lease_expire=2014-04-01 12:37:46.891457
>> has v0 lc 504309
>>  -9997> 2014-04-01 12:37:41.893979 7fa39cb1d700  1 --
>> 10.141.8.180:6789/0 <== mon.2 10.141.8.182:6789/0 672234191 ====
>> paxos(lease_ack lc 504309 fc 503758 pn 0 opn 0) v3 ==== 80+0+0
>> (2650854818 0 0) 0x6246e00 con 0x2051760
>>  -9996> 2014-04-01 12:37:42.087358 7fa39cb1d700  1 --
>> 10.141.8.180:6789/0 <== mon.2 10.141.8.182:6789/0 672234192 ====
>> forward(pg_stats(1 pgs tid 13980 v 0) v1 caps allow profile osd tid
>> 57315 con_features 4398046511103) to leader v2 ==== 916+0+0 (3952263404
>> 0 0) 0x6247580 con 0x2051760
>>  -9995> 2014-04-01 12:37:42.087395 7fa39cb1d700  1
>> mon.ceph001 at 0(leader).paxos(paxos active c 503758..504309) is_readable
>> now=2014-04-01 12:37:42.087396 lease_expire=2014-04-01 12:37:46.891457
>> has v0 lc 504309
>>  -9994> 2014-04-01 12:37:42.890767 7fa39d51e700  5
>> mon.ceph001 at 0(leader).paxos(paxos active c 503758..504309)
>> queue_proposal bl 16145 bytes; ctx = 0x1fa6e90
>>  -9993> 2014-04-01 12:37:42.891438 7fa39d51e700  1 --
>> 10.141.8.180:6789/0 --> mon.2 10.141.8.182:6789/0 -- paxos(begin lc
>> 504309 fc 0 pn 1800 opn 0) v3 -- ?+0 0x369d000
>>  -9992> 2014-04-01 12:37:42.891520 7fa39d51e700  5
>> mon.ceph001 at 0(leader).paxos(paxos updating c 503758..504309)
>> queue_proposal bl 1025 bytes; ctx = 0x1fa0af0
>>  -9991> 2014-04-01 12:37:42.891527 7fa39d51e700  5
>> mon.ceph001 at 0(leader).paxos(paxos updating c 503758..504309)
>> propose_new_value not active; proposal queued
>>  -9990> 2014-04-01 12:37:42.892656 7fa39cb1d700  1 --
>> 10.141.8.180:6789/0 <== mon.2 10.141.8.182:6789/0 672234193 ====
>> paxos(accept lc 504309 fc 0 pn 1800 opn 0) v3 ==== 80+0+0 (2064056753 0
>> 0) 0x369d000 con 0x2051760
>>  -9989> 2014-04-01 12:37:42.893660 7fa39cb1d700  1 --
>> 10.141.8.180:6789/0 --> mon.2 10.141.8.182:6789/0 -- paxos(commit lc
>> 504310 fc 0 pn 1800 opn 0) v3 -- ?+0 0x6247580
>>  -9988> 2014-04-01 12:37:42.894019 7fa39cb1d700  1 --
>> 10.141.8.180:6789/0 --> 10.141.8.180:6789/0 -- log(last 1348) v1 -- ?+0
>> 0x41ddcc0 con 0x2050840
>>  -9987> 2014-04-01 12:37:42.894038 7fa39cb1d700  1 --
>> 10.141.8.180:6789/0 --> mon.2 10.141.8.182:6789/0 -- paxos(lease lc
>> 504310 fc 503758 pn 0 opn 0) v3 -- ?+0 0x6246e00
>>
>> 2 of the 3 MONS are down, the MDSs are down, but the OSDs are still
>> running..
>> Does someone knows what is happening here?
>>
>>
>> Thanks!
>>
>> Kind regards,
>> Kenneth
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> -- 
> Joao Eduardo Luis
> Software Engineer | http://inktank.com | http://ceph.com


----- End message from Joao Eduardo Luis <joao.luis at inktank.com> -----

-- 

Met vriendelijke groeten,
Kenneth Waegeman



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux