Re: MDS dying on cuttlefish

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 28 May 2013 10:11:36 -0700



On Thu, May 23, 2013 at 2:43 PM, Giuseppe 'Gippa' Paterno'
<gpaterno@xxxxxxxxxxxx> wrote:
> Hi!
>
> I've got a cluster of two nodes on Ubuntu 12.04 with cuttlefish from the
> ceph.com repo.
> ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
>
> The MDS process is dying after a while with a stack trace, but I can't
> understand why.
> I reproduced the same problem on debian 7 with the same repository.
>
>     -3> 2013-05-23 23:00:42.957679 7fa39e28e700  1 --
> 10.123.200.189:6800/28919 <== osd.0 10.123.200.188:6802/27665 1 ====
> osd_op_reply(5 200.00000000 [read 0~0] ack = -2 (No such file or
> directory)) v4 ==== 111+0+0 (2261481792 0 0) 0x29afe00 con 0x29c4b00
>     -2> 2013-05-23 23:00:42.957780 7fa39e28e700  0 mds.0.journaler(ro)
> error getting journal off disk
>     -1> 2013-05-23 23:00:42.960974 7fa39e28e700  1 --
> 10.123.200.189:6800/28919 <== osd.0 10.123.200.188:6802/27665 2 ====
> osd_op_reply(1 mds0_inotable [read 0~0] ack = -2 (No such file or
> directory)) v4 ==== 112+0+0 (1612134461 0 0) 0x2a1c200 con 0x29c4b00
>      0> 2013-05-23 23:00:42.963326 7fa39e28e700 -1 mds/MDSTable.cc: In
> function 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)'
> thread 7fa39e28e700 time 2013-05-23 23:00:42.961076
> mds/MDSTable.cc: 150: FAILED assert(0)
>
>  ceph version 0.61.2 (fea782543a844bb277ae94d3391788b76c5bee60)
>  1: (MDSTable::load_2(int, ceph::buffer::list&, Context*)+0x3bb) [0x6dd2db]
>  2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1b) [0x7275bb]
>  3: (MDS::handle_core_message(Message*)+0xae7) [0x513c57]
>  4: (MDS::_dispatch(Message*)+0x33) [0x513d53]
>  5: (MDS::ms_dispatch(Message*)+0xab) [0x515b3b]
>  6: (DispatchQueue::entry()+0x393) [0x847ca3]
>  7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7caeed]
>  8: (()+0x6b50) [0x7fa3a3376b50]
>  9: (clone()+0x6d) [0x7fa3a1d24a7d]
>
> Full logs here:
> http://pastebin.com/C81g5jFd
>
> I can't understand why and I'd really appreciate an hint.

This backtrace indicates that the MDS went to load a RADOS object that
doesn't exist. We've seen this popping up occasionally but sadly
haven't been able to diagnose the cause (for developers following
along at home, I'm wondering if it's related to
http://tracker.ceph.com/issues/4894, but that's pure speculation; I
haven't checked the write orders at all). Do I correctly assume that
you don't have any CephFS data in the cluster yet? If so, I'd just
delete your current filesystem and metadata pool, then recreate them.
It should all be in the docs. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com