ceph 0.78 mon and mds crashing (bus error)

stijn.deweirdt at ugent.be (Stijn De Weirdt) · Wed, 02 Apr 2014 18:23:19 +0200



hi,

>> 1) How big and what shape the filesystem is. Do you have some
>> extremely large directory that the MDS keeps trying to load and then
>> dump?
> anyway to extract this from the mds without having to start it? as it
> was an rsync operation, i can try to locate possible candidates on the
> source filesystem, but what would be considered "large"?
total number of files 13M, spread over 800k directories, but it's 
unclear how far the sync was at time of failing. i've not found a good 
way to for directories with lots of files and/or subdirs.

>
>> 2) Use tcmalloc's heap analyzer to see where all the memory is being
>> allocated.
> we'll giv ethat a try
i run ceph-mds with HEAPCHECK=normal (via the init script), but how can 
we "stop" mds without killing it? the heapchecker only seems to dump at 
the end of a run, maybe there's a way to have intermediate dump like 
valgrind, but the documentation is not very helpful.

stijn

>
>> 3) Look through the logs for when the beacon fails (the first of
>> "mds.0.16 is_laggy 600.641332 > 15 since last acked beacon") and see
>> if there's anything tell-tale going on at the time.
>
> anything in particular we should be looking for?
>
> the log goes as follows:
> mds starts around 11:43
> ...
>
>> 2014-04-01 11:44:23.658583 7ffec89c6700  1 mds.0.server
>> reconnect_clients -- 1 sessions
>> 2014-04-01 11:44:41.212488 7ffec89c6700  0 log [DBG] : reconnect by
>> client.4585 10.141.8.199:0/3551 after 17.553854
>> 2014-04-01 11:44:45.692237 7ffec89c6700  1 mds.0.10 reconnect_done
>> 2014-04-01 11:44:45.996384 7ffec89c6700  1 mds.0.10 handle_mds_map i
>> am now mds.0.10
>> 2014-04-01 11:44:45.996388 7ffec89c6700  1 mds.0.10 handle_mds_map
>> state change up:reconnect --> up:rejoin
>> 2014-04-01 11:44:45.996390 7ffec89c6700  1 mds.0.10 rejoin_start
>> 2014-04-01 11:49:53.158471 7ffec89c6700  1 mds.0.10 rejoin_joint_start
>
> then lots (4667 lines) of
>> 2014-04-01 11:50:10.237035 7ffebc844700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.180:6789/0 pipe(0x38a7da00 sd=104 :41115 s=4 pgs=0 cs=0
>> l=1 c=0x6513e8840).connect got RESETSESSION but no longer connecting
>
> with one intermediate
>> 2014-04-01 11:51:50.181354 7ffebcf4b700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.180:6789/0 pipe(0x10e282580 sd=103 :0 s=1 pgs=0 cs=0 l=1
>> c=0xc77d5ee0).fault
>
>
> then sudden change
>> 2014-04-01 11:57:30.440554 7ffebcd49700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.182:6789/0 pipe(0xa1534100 sd=104 :48176 s=4 pgs=0 cs=0
>> l=1 c=0xd99b11e0).connect got RESETSESSION but no longer connecting
>> 2014-04-01 11:57:30.722607 7ffebec68700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=104 :48235 s=1 pgs=0 cs=0
>> l=1 c=0xc48a3f40).connect got BADAUTHORIZER
>> 2014-04-01 11:57:30.722669 7ffebec68700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=104 :48235 s=1 pgs=0 cs=0
>> l=1 c=0xc48a3f40).connect got BADAUTHORIZER
>> 2014-04-01 11:57:30.722885 7ffebec68700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=57 :48237 s=1 pgs=0 cs=0 l=1
>> c=0xc48a3f40).connect got BADAUTHORIZER
>> 2014-04-01 11:57:30.722945 7ffebec68700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.181:6789/0 pipe(0x1ce98a00 sd=57 :48237 s=1 pgs=0 cs=0 l=1
>> c=0xc48a3f40).connect got BADAUTHORIZER
>
> followed by lots of
>> 2014-04-01 11:57:30.738562 7ffebec68700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.181:6789/0 pipe(0xead9fa80 sd=57 :0 s=1 pgs=0 cs=0 l=1
>> c=0x10e5d280).fault
>
> with sporadic
>> 2014-04-01 11:57:32.431219 7ffebeb67700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.182:6789/0 pipe(0xef85cd80 sd=103 :41218 s=4 pgs=0 cs=0
>> l=1 c=0x130590dc0).connect got RESETSESSION but no longer connecting
>
>
> until the dmup
>> 2014-04-01 11:59:27.612850 7ffebea66700  0 -- 10.141.8.180:6837/55117
>> >> 10.141.8.181:6789/0 pipe(0xe3036400 sd=103 :0 s=1 pgs=0 cs=0 l=1
>> c=0xa7be300).fault
>> 2014-04-01 11:59:27.639009 7ffec89c6700 -1 mds/CDir.cc: In function
>> 'void CDir::_omap_fetched(ceph::bufferlist&,
>> std::map<std::basic_string<char, std::char_traits<char>, std::allocator\
>> <char> >, ceph::buffer::list, std::less<std::basic_string<char,
>> std::char_traits<char>, std::allocator<char> > >,
>> std::allocator<std::pair<const std::basic_string<char, std::char_trait\
>> s<char>, std::allocator<char> >, ceph::buffer::list> > >&, const
>> std::string&, int)' thread 7ffec89c6700 time 2014-04-01 11:59:27.620684
>> mds/CDir.cc: 1474: FAILED assert(r == 0 || r == -2 || r == -61)
>>
>>  ceph version 0.78 (f6c746c314d7b87b8419b6e584c94bfe4511dbd4)
>>  1: (CDir::_omap_fetched(ceph::buffer::list&, std::map<std::string,
>> ceph::buffer::list, std::less<std::string>,
>> std::allocator<std::pair<std::string const, ceph::buffer::list> > >&, st\
>> d::string const&, int)+0x4d71) [0x77c3c1]
>>  2: (Context::complete(int)+0x9) [0x56bb79]
>>  3: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0x11a6) [0x806dd6]
>>  4: (MDS::handle_core_message(Message*)+0x9c7) [0x5901d7]
>>  5: (MDS::_dispatch(Message*)+0x2f) [0x59028f]
>>  6: (MDS::ms_dispatch(Message*)+0x1ab) [0x591d4b]
>>  7: (DispatchQueue::entry()+0x582) [0x902072]
>>  8: (DispatchQueue::DispatchThread::entry()+0xd) [0x85ef4d]
>>  9: /lib64/libpthread.so.0() [0x34c36079d1]
>>  10: (clone()+0x6d) [0x34c32e8b6d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.
>>
>> --- begin dump of recent events ---
>> -10000> 2014-04-01 11:59:27.137779 7ffec89c6700  5 mds.0.10 initiating
>> monitor reconnect; maybe we're not the slow one
>>  -9999> 2014-04-01 11:59:27.137787 7ffec89c6700 10 monclient(hunting):
>> _reopen_session rank -1 name
>>  -9998> 2014-04-01 11:59:27.137790 7ffec89c6700  1 --
>> 10.141.8.180:6837/55117 mark_down 0x1326be460 -- 0x2cae5f00
>>  -9997> 2014-04-01 11:59:27.137807 7ffebf672700  2 --
>> 10.141.8.180:6837/55117 >> 10.141.8.181:6789/0 pipe(0x2cae5f00 sd=57
>> :0 s=4 pgs=0 cs=0 l=1 c=0x1326be460).connect error 10.141.8.1\
>> 81:6789/0, 104: Connection reset by peer
>>  -9996> 2014-04-01 11:59:27.137819 7ffebf672700  3 --
>> 10.141.8.180:6837/55117 >> 10.141.8.181:6789/0 pipe(0x2cae5f00 sd=57
>> :0 s=4 pgs=0 cs=0 l=1 c=0x1326be460).connect fault, but state\
>>  = closed != connecting, stopping
>>  -9995> 2014-04-01 11:59:27.137824 7ffec89c6700 10 monclient(hunting):
>> picked mon.ceph003 con 0x3af54ba0 addr 10.141.8.182:6789/0
>>  -9994> 2014-04-01 11:59:27.137832 7ffec89c6700 10 monclient(hunting):
>> _send_mon_message to mon.ceph003 at 10.141.8.182:6789/0
>
>
>
>
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Wed, Apr 2, 2014 at 3:39 AM, Kenneth Waegeman
>> <Kenneth.Waegeman at ugent.be> wrote:
>>>
>>> ----- Message from Gregory Farnum <greg at inktank.com> ---------
>>>     Date: Tue, 1 Apr 2014 09:03:17 -0700
>>>     From: Gregory Farnum <greg at inktank.com>
>>>
>>> Subject: Re: ceph 0.78 mon and mds crashing (bus error)
>>>       To: "Yan, Zheng" <ukernel at gmail.com>
>>>       Cc: Kenneth Waegeman <Kenneth.Waegeman at ugent.be>, ceph-users
>>> <ceph-users at lists.ceph.com>
>>>
>>>
>>>
>>>> On Tue, Apr 1, 2014 at 7:12 AM, Yan, Zheng <ukernel at gmail.com> wrote:
>>>>>
>>>>> On Tue, Apr 1, 2014 at 10:02 PM, Kenneth Waegeman
>>>>> <Kenneth.Waegeman at ugent.be> wrote:
>>>>>>
>>>>>> After some more searching, I've found that the source of the
>>>>>> problem is
>>>>>> with
>>>>>> the mds and not the mon.. The mds crashes, generates a core dump that
>>>>>> eats
>>>>>> the local space, and in turn the monitor (because of leveldb)
>>>>>> crashes.
>>>>>>
>>>>>> The error in the mds log of one host:
>>>>>>
>>>>>> 2014-04-01 15:46:34.414615 7f870e319700  0 --
>>>>>> 10.141.8.180:6836/13152 >>
>>>>>> 10.141.8.180:6789/0 pipe(0x517371180 sd=54 :42439 s=4 pgs=0 cs=0 l=1
>>>>>> c=0x147ac780).connect got RESETSESSION but no longer connecting
>>>>>> 2014-04-01 15:46:34.438792 7f871194f700  0 --
>>>>>> 10.141.8.180:6836/13152 >>
>>>>>> 10.141.8.180:6789/0 pipe(0x1b099f580 sd=8 :43150 s=4 pgs=0 cs=0 l=1
>>>>>> c=0x1fd44360).connect got RESETSESSION but no longer connecting
>>>>>> 2014-04-01 15:46:34.439028 7f870e319700  0 --
>>>>>> 10.141.8.180:6836/13152 >>
>>>>>> 10.141.8.182:6789/0 pipe(0x13aa64880 sd=54 :37085 s=4 pgs=0 cs=0 l=1
>>>>>> c=0x1fd43de0).connect got RESETSESSION but no longer connecting
>>>>>> 2014-04-01 15:46:34.468257 7f871b7ae700 -1 mds/CDir.cc: In function
>>>>>> 'void
>>>>>> CDir::_omap_fetched(ceph::bufferlist&,
>>>>>> std::map<std::basic_string<char,
>>>>>> std::char_traits<char>, std::allocator<char> >, ceph::buffer::list,
>>>>>> std::less<std::basic_string<char, std::char_traits<char>,
>>>>>> std::allocator<char> > >, std::allocator<std::pair<const
>>>>>> std::basic_string<char, std::char_traits<char>,
>>>>>> std::allocator<char> >,
>>>>>> ceph::buffer::list> > >&, const std::string&, int)' thread
>>>>>> 7f871b7ae700
>>>>>> time
>>>>>> 2014-04-01 15:46:34.448320
>>>>>> mds/CDir.cc: 1474: FAILED assert(r == 0 || r == -2 || r == -61)
>>>>>>
>>>>>
>>>>> could you use gdb to check what is value of variable 'r' .
>>>>
>>>>
>>>> If you look at the crash dump log you can see the return value in the
>>>> osd_op_reply message:
>>>> -1> 2014-04-01 15:46:34.440860 7f871b7ae700  1 --
>>>> 10.141.8.180:6836/13152 <== osd.3 10.141.8.180:6827/4366 33077 ====
>>>> osd_op_reply(4179177 100001f2ef1.00000000 [omap-get-header
>>>> 0~0,omap-get-vals 0~16] v0'0 uv0 ack = -108 (Cannot send after
>>>> transport endpoint shutdown)) v6 ==== 229+0+0 (958358678 0 0)
>>>> 0x2cff7aa80 con 0x37ea3c0
>>>>
>>>> -108, which is ESHUTDOWN, but we also use it (via the 108 constant, I
>>>> think because ESHUTDOWN varies across platforms) as EBLACKLISTED.
>>>> So it looks like this is itself actually a symptom of another problem
>>>> that is causing the MDS to get timed out on the monitor. If a core
>>>> dump is "eating the local space", maybe the MDS is stuck in an
>>>> infinite allocation loop of some kind? How big are your disks,
>>>> Kenneth? Do you have any information on how much CPU/memory the MDS
>>>> was using before this?
>>>
>>>
>>> I monitored the mds process after restart:
>>>
>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> 19215 root      20   0 6070m 5.7g 5236 S 778.6 18.1   1:27.54 ceph-mds
>>> 19215 root      20   0 7926m 7.5g 5236 S 179.2 23.8   2:44.39 ceph-mds
>>> 19215 root      20   0 12.4g  12g 5236 S 157.2 38.8   3:43.47 ceph-mds
>>> 19215 root      20   0 16.6g  16g 5236 S 144.4 52.0   4:15.01 ceph-mds
>>> 19215 root      20   0 19.9g  19g 5236 S 137.2 62.5   4:35.83 ceph-mds
>>> 19215 root      20   0 24.5g  24g 5224 S 136.5 77.0   5:04.66 ceph-mds
>>> 19215 root      20   0 25.8g  25g 2944 S 33.7 81.2   5:13.74 ceph-mds
>>> 19215 root      20   0 26.0g  25g 2916 S 24.6 81.7   5:19.07 ceph-mds
>>> 19215 root      20   0 26.1g  25g 2916 S 13.0 82.1   5:22.16 ceph-mds
>>> 19215 root      20   0 27.7g  26g 1856 S 100.0 85.8   5:36.46 ceph-mds
>>>
>>> Then it crashes. I changed the core dump location out of the root fs,
>>> the
>>> core dump is indeed about 26G
>>>
>>> My disks:
>>>
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/sda2       9.9G  2.9G  6.5G  31% /
>>> tmpfs            16G     0   16G   0% /dev/shm
>>> /dev/sda1       248M   53M  183M  23% /boot
>>> /dev/sda4       172G   61G  112G  35% /var/lib/ceph/log/sda4
>>> /dev/sdb        187G   61G  127G  33% /var/lib/ceph/log/sdb
>>> /dev/sdc        3.7T  1.7T  2.0T  47% /var/lib/ceph/osd/sdc
>>> /dev/sdd        3.7T  1.5T  2.2T  41% /var/lib/ceph/osd/sdd
>>> /dev/sde        3.7T  1.4T  2.4T  37% /var/lib/ceph/osd/sde
>>> /dev/sdf        3.7T  1.5T  2.3T  39% /var/lib/ceph/osd/sdf
>>> /dev/sdg        3.7T  2.1T  1.7T  56% /var/lib/ceph/osd/sdg
>>> /dev/sdh        3.7T  1.7T  2.0T  47% /var/lib/ceph/osd/sdh
>>> /dev/sdi        3.7T  1.7T  2.0T  47% /var/lib/ceph/osd/sdi
>>> /dev/sdj        3.7T  1.7T  2.0T  47% /var/lib/ceph/osd/sdj
>>> /dev/sdk        3.7T  2.1T  1.6T  58% /var/lib/ceph/osd/sdk
>>> /dev/sdl        3.7T  1.7T  2.0T  46% /var/lib/ceph/osd/sdl
>>> /dev/sdm        3.7T  1.5T  2.2T  41% /var/lib/ceph/osd/sdm
>>> /dev/sdn        3.7T  1.4T  2.3T  38% /var/lib/ceph/osd/sdn
>>>
>>>
>>>
>>>
>>>> -Greg
>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>>
>>> ----- End message from Gregory Farnum <greg at inktank.com> -----
>>>
>>>
>>>
>>> --
>>>
>>> Met vriendelijke groeten,
>>> Kenneth Waegeman
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>