Re: MDS stuck in a crash loop

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>> About an hour ago my MDSs (primary and follower) started ping-pong
>> crashing with this message. I've spent about 30 minutes looking into
>> it but nothing yet.
>>
>> This is from a 0.94.3 MDS
>>
>
>>      0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc:
>> In function 'virtual void C_IO_SM_Save::finish(int)' thread
>> 7fd4f52ad700 time 2015-10-11 17:01:23.594089
>> mds/SessionMap.cc: 120: FAILED assert(r == 0)
>
> These "r == 0" asserts pretty much always mean that the MDS did did a
> read or write to RADOS (the OSDs) and got an error of some kind back.
> (Or in the case of the OSDs, access to the local filesystem returned
> an error, etc.) I don't think these writes include any safety checks
> which would let the MDS break it which means that probably the OSD is
> actually returning an error — odd, but not impossible.
>
> Notice that the assert happened in thread 7fd4f52ad700, and look for
> the stuff in that thread. You should be able to find an OSD op reply
> (on the SessionMap object) coming in and reporting an error code.
> -Greg

I only two error ops in that whole MDS session. Neither one happened
on the same thread (7f5ab6000700 in this file). But it looks like the
only session map is the -90 "Message too long" one.

mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v
'ondisk = 0'
 -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700  1 --
10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 ====
osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0
ondisk = -90 ((90) Message too long)) v6 ==== 182+0+0 (2955408122 0 0)
0x3a55d340 con 0x3d5a3c0
  -705> 2015-10-11 20:51:11.374132 7f5ab22f4700  1 --
10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 ====
osd_op_reply(48004 300.0000e274 [delete] v0'0 uv1349638 ondisk = -2
((2) No such file or directory)) v6 ==== 179+0+0 (1182549251 0 0)
0x66c5c80 con 0x3d5a7e0

Any idea what this could be Greg?

>
>>
>>  ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x8b) [0x94cc1b]
>>  2: /usr/bin/ceph-mds() [0x7c7df1]
>>  3: (MDSIOContextBase::complete(int)+0x81) [0x7c83b1]
>>  4: (Finisher::finisher_thread_entry()+0x1a0) [0x87f490]
>>  5: (()+0x8182) [0x7fd4fd031182]
>>  6: (clone()+0x6d) [0x7fd4fb7a047d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
>> needed to interpret this.

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux