MDS dying on Ceph 0.67.10

tientienminh080590@xxxxxxxxx (MinhTien MinhTien) · Thu, 28 Aug 2014 08:22:59 +0700

Dear Zheng Yan,

I will send you if errors occur.
I use 3 mds with 1 active & 2 stanby.
How to backup and restore metadata?

On Wed, Aug 27, 2014 at 3:09 PM, Yan, Zheng <ukernel at gmail.com> wrote:

> Please first delete the old mds log, then run mds with "debug_mds = 15".
> Send the whole mds log to us after the mds crashes.
>
> Yan, Zheng
>
>
> On Wed, Aug 27, 2014 at 12:12 PM, MinhTien MinhTien <
> tientienminh080590 at gmail.com> wrote:
>
>> Hi Gregory Farmum,
>>
>> Thank you for your reply!
>> This is the log:
>>
>> 2014-08-26 16:22:39.103461 7f083752f700 -1 mds/CDir.cc: In function 'void
>> CDir::_committed(version_t)' thread 7f083752f700 time 2014-08-26
>> 16:22:39.075809
>> mds/CDir.cc: 2071: FAILED assert(in->is_dirty() || in->last <
>> ((__u64)(-2)))
>>
>>  ceph version 0.67.10 (9d446bd416c52cd785ccf048ca67737ceafcdd7f)
>>  1: (CDir::_committed(unsigned long)+0xc4e) [0x74d9ee]
>>  2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe8d) [0x7d09bd]
>>  3: (MDS::handle_core_message(Message*)+0x987) [0x57c457]
>>  4: (MDS::_dispatch(Message*)+0x2f) [0x57c50f]
>>  5: (MDS::ms_dispatch(Message*)+0x19b) [0x57dfbb]
>>  6: (DispatchQueue::entry()+0x5a2) [0x904732]
>>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x8afdbd]
>>  8: (()+0x79d1) [0x7f083c2979d1]
>>  9: (clone()+0x6d) [0x7f083afb6b5d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>> --- logging levels ---
>>    0/ 5 none
>>    0/ 1 lockdep
>>    0/ 1 context
>>    1/ 1 crush
>>    1/ 5 mds
>>    1/ 5 mds_balancer
>>    1/ 5 mds_locker
>>    1/ 5 mds_log
>>    1/ 5 mds_log_expire
>>    1/ 5 mds_migrator
>>    0/ 1 buffer
>>    0/ 1 timer
>>    0/ 1 filer
>>    0/ 1 striper
>>    0/ 1 objecter
>>    0/ 5 rados
>>    0/ 5 rbd
>>    0/ 5 journaler
>>    0/ 5 objectcacher
>>    0/ 5 client
>>    0/ 5 osd
>>    0/ 5 optracker
>>    0/ 5 objclass
>>    1/ 3 filestore
>>    1/ 3 journal
>>    0/ 5 ms
>>    1/ 5 mon
>>    0/10 monc
>>    1/ 5 paxos
>>    0/ 5 tp
>>    1/ 5 auth
>>    1/ 5 crypto
>>    1/ 1 finisher
>>    1/ 5 heartbeatmap
>>    1/ 5 perfcounter
>>    1/ 5 rgw
>>    1/ 5 hadoop
>>    1/ 5 javaclient
>>    1/ 5 asok
>>    1/ 1 throttle
>>   -2/-2 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent     10000
>>   max_new         1000
>>   log_file /var/log/ceph/ceph-mds.Ceph01-dc5k3u0104.log
>> --- end dump of recent events ---
>> 2014-08-26 16:22:39.134173 7f083752f700 -1 *** Caught signal (Aborted) **
>>  in thread 7f083752f700
>>
>>
>>
>>
>> On Wed, Aug 27, 2014 at 3:09 AM, Gregory Farnum <greg at inktank.com> wrote:
>>
>>> I don't think the log messages you're showing are the actual cause of
>>> the failure. The log file should have a proper stack trace (with
>>> specific function references and probably a listed assert failure),
>>> can you find that?
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Tue, Aug 26, 2014 at 9:11 AM, MinhTien MinhTien
>>> <tientienminh080590 at gmail.com> wrote:
>>> > Hi all,
>>> >
>>> > I have a cluster of 2 nodes on Centos 6.5 with ceph 0.67.10 (replicate
>>> =  2)
>>> >
>>> > When I add the 3rd node in the Ceph Cluster, CEPH perform load
>>> balancing.
>>> >
>>> > I have 3 MDS in 3 nodes,the MDS process is dying after a while with a
>>> stack
>>> > trace:
>>> >
>>> >
>>> -------------------------------------------------------------------------------------------------------------------------------------------------------
>>> >
>>> >  2014-08-26 17:08:34.362901 7f1c2c704700  1 -- 10.20.0.21:6800/22154
>>> <==
>>> > osd.10 10.20.0.21:6802/15917 1 ==== osd_op_reply(230
>>> 100000003f6.00000000
>>> > [tmapup 0~0] ondisk = 0) v4 ==== 119+0+0 (1770421071 0 0) 0x2aece00 con
>>> > 0x2aa4200
>>> >    -54> 2014-08-26 17:08:34.362942 7f1c2c704700  1 --
>>> 10.20.0.21:6800/22154
>>> > <== osd.55 10.20.0.23:6800/2407 10 ==== osd_op_reply(263
>>> > 1000000048a.00000000 [getxattr] ack = -2 (No such file or directory))
>>> v4
>>> > ==== 119+0+0 (3908997833 0 0) 0x1e63000 con 0x1e7aaa0
>>> >    -53> 2014-08-26 17:08:34.363001 7f1c2c704700  5 mds.0.log
>>> submit_entry
>>> > 427629603~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>>> >    -52> 2014-08-26 17:08:34.363022 7f1c2c704700  1 --
>>> 10.20.0.21:6800/22154
>>> > <== osd.37 10.20.0.22:6898/11994 6 ==== osd_op_reply(226 1.00000000
>>> [tmapput
>>> > 0~7664] ondisk = 0) v4 ==== 109+0+0 (1007110430 0 0) 0x1e64800 con
>>> 0x1e7a7e0
>>> >    -51> 2014-08-26 17:08:34.363092 7f1c2c704700  5 mds.0.log _expired
>>> > segment 293601899 2548 events
>>> >    -50> 2014-08-26 17:08:34.363117 7f1c2c704700  1 --
>>> 10.20.0.21:6800/22154
>>> > <== osd.17 10.20.0.21:6941/17572 9 ==== osd_op_reply(264
>>> > 10000000489.00000000 [getxattr] ack = -2 (No such file or directory))
>>> v4
>>> > ==== 119+0+0 (1979034473 0 0) 0x1e62200 con 0x1e7b180
>>> >    -49> 2014-08-26 17:08:34.363177 7f1c2c704700  5 mds.0.log
>>> submit_entry
>>> > 427631148~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>>> >    -48> 2014-08-26 17:08:34.363197 7f1c2c704700  1 --
>>> 10.20.0.21:6800/22154
>>> > <== osd.1 10.20.0.21:6872/13227 6 ==== osd_op_reply(265
>>> 10000000491.00000000
>>> > [getxattr] ack = -2 (No such file or directory)) v4 ==== 119+0+0
>>> (1231782695
>>> > 0 0) 0x1e63400 con 0x1e7ac00
>>> >    -47> 2014-08-26 17:08:34.363255 7f1c2c704700  5 mds.0.log
>>> submit_entry
>>> > 427632693~1541 : EUpdate purge_stray truncate [metablob 100, 2 dirs]
>>> >    -46> 2014-08-26 17:08:34.363274 7f1c2c704700  1 --
>>> 10.20.0.21:6800/22154
>>> > <== osd.11 10.20.0.21:6884/7018 5 ==== osd_op_reply(266
>>> 1000000047d.00000000
>>> > [getxattr] ack = -2 (No such file or directory)) v4 ==== 119+0+0
>>> (2737916920
>>> > 0 0) 0x1e61e00 con 0x1e7bc80
>>> >
>>> >
>>> ---------------------------------------------------------------------------------------------------------------------------------------------
>>> > I try to restart MDSs, but after a few seconds in a state of "active",
>>> MDS
>>> > switch to state "laggy or crashed". I have a lot of important data on
>>> it.
>>> > I do not want to use the command:
>>> > ceph mds newfs <metadata pool id> <data pool id> --yes-i-really-mean-it
>>> >
>>> > :(
>>> >
>>> > Tien Bui.
>>> >
>>> >
>>> >
>>> > --
>>> > Bui Minh Tien
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users at lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>
>>
>>
>> --
>> Bui Minh Tien
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>

-- 
Bui Minh Tien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140828/f1c0df96/attachment.htm>