ceph 0.78 mon and mds crashing (bus error)

greg at inktank.com (Gregory Farnum) · Tue, 1 Apr 2014 09:03:17 -0700



On Tue, Apr 1, 2014 at 7:12 AM, Yan, Zheng <ukernel at gmail.com> wrote:
> On Tue, Apr 1, 2014 at 10:02 PM, Kenneth Waegeman
> <Kenneth.Waegeman at ugent.be> wrote:
>> After some more searching, I've found that the source of the problem is with
>> the mds and not the mon.. The mds crashes, generates a core dump that eats
>> the local space, and in turn the monitor (because of leveldb) crashes.
>>
>> The error in the mds log of one host:
>>
>> 2014-04-01 15:46:34.414615 7f870e319700  0 -- 10.141.8.180:6836/13152 >>
>> 10.141.8.180:6789/0 pipe(0x517371180 sd=54 :42439 s=4 pgs=0 cs=0 l=1
>> c=0x147ac780).connect got RESETSESSION but no longer connecting
>> 2014-04-01 15:46:34.438792 7f871194f700  0 -- 10.141.8.180:6836/13152 >>
>> 10.141.8.180:6789/0 pipe(0x1b099f580 sd=8 :43150 s=4 pgs=0 cs=0 l=1
>> c=0x1fd44360).connect got RESETSESSION but no longer connecting
>> 2014-04-01 15:46:34.439028 7f870e319700  0 -- 10.141.8.180:6836/13152 >>
>> 10.141.8.182:6789/0 pipe(0x13aa64880 sd=54 :37085 s=4 pgs=0 cs=0 l=1
>> c=0x1fd43de0).connect got RESETSESSION but no longer connecting
>> 2014-04-01 15:46:34.468257 7f871b7ae700 -1 mds/CDir.cc: In function 'void
>> CDir::_omap_fetched(ceph::bufferlist&, std::map<std::basic_string<char,
>> std::char_traits<char>, std::allocator<char> >, ceph::buffer::list,
>> std::less<std::basic_string<char, std::char_traits<char>,
>> std::allocator<char> > >, std::allocator<std::pair<const
>> std::basic_string<char, std::char_traits<char>, std::allocator<char> >,
>> ceph::buffer::list> > >&, const std::string&, int)' thread 7f871b7ae700 time
>> 2014-04-01 15:46:34.448320
>> mds/CDir.cc: 1474: FAILED assert(r == 0 || r == -2 || r == -61)
>>
>
> could you use gdb to check what is value of variable 'r' .

If you look at the crash dump log you can see the return value in the
osd_op_reply message:
-1> 2014-04-01 15:46:34.440860 7f871b7ae700  1 --
10.141.8.180:6836/13152 <== osd.3 10.141.8.180:6827/4366 33077 ====
osd_op_reply(4179177 100001f2ef1.00000000 [omap-get-header
0~0,omap-get-vals 0~16] v0'0 uv0 ack = -108 (Cannot send after
transport endpoint shutdown)) v6 ==== 229+0+0 (958358678 0 0)
0x2cff7aa80 con 0x37ea3c0

-108, which is ESHUTDOWN, but we also use it (via the 108 constant, I
think because ESHUTDOWN varies across platforms) as EBLACKLISTED.
So it looks like this is itself actually a symptom of another problem
that is causing the MDS to get timed out on the monitor. If a core
dump is "eating the local space", maybe the MDS is stuck in an
infinite allocation loop of some kind? How big are your disks,
Kenneth? Do you have any information on how much CPU/memory the MDS
was using before this?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com