Re: MDS stuck in a crash loop

Milosz Tanski <milosz@xxxxxxxxx> · Mon, 19 Oct 2015 11:31:42 -0400

On Wed, Oct 14, 2015 at 12:46 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> On Sun, Oct 11, 2015 at 7:36 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>> On Sun, Oct 11, 2015 at 6:44 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>>> On Sun, Oct 11, 2015 at 6:01 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>>>> On Sun, Oct 11, 2015 at 5:33 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>>>>> On Sun, Oct 11, 2015 at 5:24 PM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>>>>>> On Sun, Oct 11, 2015 at 1:16 PM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>>>>>> On Sun, Oct 11, 2015 at 10:09 AM, Milosz Tanski <milosz@xxxxxxxxx> wrote:
>>>>>>>> About an hour ago my MDSs (primary and follower) started ping-pong
>>>>>>>> crashing with this message. I've spent about 30 minutes looking into
>>>>>>>> it but nothing yet.
>>>>>>>>
>>>>>>>> This is from a 0.94.3 MDS
>>>>>>>>
>>>>>>>
>>>>>>>>      0> 2015-10-11 17:01:23.596008 7fd4f52ad700 -1 mds/SessionMap.cc:
>>>>>>>> In function 'virtual void C_IO_SM_Save::finish(int)' thread
>>>>>>>> 7fd4f52ad700 time 2015-10-11 17:01:23.594089
>>>>>>>> mds/SessionMap.cc: 120: FAILED assert(r == 0)
>>>>>>>
>>>>>>> These "r == 0" asserts pretty much always mean that the MDS did did a
>>>>>>> read or write to RADOS (the OSDs) and got an error of some kind back.
>>>>>>> (Or in the case of the OSDs, access to the local filesystem returned
>>>>>>> an error, etc.) I don't think these writes include any safety checks
>>>>>>> which would let the MDS break it which means that probably the OSD is
>>>>>>> actually returning an error — odd, but not impossible.
>>>>>>>
>>>>>>> Notice that the assert happened in thread 7fd4f52ad700, and look for
>>>>>>> the stuff in that thread. You should be able to find an OSD op reply
>>>>>>> (on the SessionMap object) coming in and reporting an error code.
>>>>>>> -Greg
>>>>>>
>>>>>> I only two error ops in that whole MDS session. Neither one happened
>>>>>> on the same thread (7f5ab6000700 in this file). But it looks like the
>>>>>> only session map is the -90 "Message too long" one.
>>>>>>
>>>>>> mtanski@tiny:~$ cat single_crash.log | grep 'osd_op_reply' | grep -v
>>>>>> 'ondisk = 0'
>>>>>>  -3946> 2015-10-11 20:51:11.013965 7f5ab20f2700  1 --
>>>>>> 10.0.5.31:6802/27121 <== osd.25 10.0.5.57:6804/32341 6163 ====
>>>>>> osd_op_reply(46349 mds0_sessionmap [writefull 0~95168363] v0'0 uv0
>>>>>> ondisk = -90 ((90) Message too long)) v6 ==== 182+0+0 (2955408122 0 0)
>>>>>> 0x3a55d340 con 0x3d5a3c0
>>>>>>   -705> 2015-10-11 20:51:11.374132 7f5ab22f4700  1 --
>>>>>> 10.0.5.31:6802/27121 <== osd.28 10.0.5.50:6801/1787 5297 ====
>>>>>> osd_op_reply(48004 300.0000e274 [delete] v0'0 uv1349638 ondisk = -2
>>>>>> ((2) No such file or directory)) v6 ==== 179+0+0 (1182549251 0 0)
>>>>>> 0x66c5c80 con 0x3d5a7e0
>>>>>>
>>>>>> Any idea what this could be Greg?
>>>>>
>>>>> To follow this up I found this ticket from 9 months ago:
>>>>> http://tracker.ceph.com/issues/10449 In there Yan says:
>>>>>
>>>>> "it's a kernel bug. hang request prevents mds from trimming
>>>>> completed_requests in sessionmap. there is nothing to do with mds.
>>>>> (maybe we should add some code to MDS to show warning when this bug
>>>>> happens)"
>>>>>
>>>>> When I was debugging this I saw an OSD (not cephfs client) operation
>>>>> stuck for a long time along with the MDS error:
>>>>>
>>>>> HEALTH_WARN 1 requests are blocked > 32 sec; 1 osds have slow
>>>>> requests; mds cluster is degraded; mds0: Behind on trimming (709/30)
>>>>> 1 ops are blocked > 16777.2 sec
>>>>> 1 ops are blocked > 16777.2 sec on osd.28
>>>>>
>>>>> I did eventually bounce the OSD in question and it hasn't become stuck
>>>>> since, but the MDS is still eating it every time with the "Message too
>>>>> long" error on the session map.
>>>>>
>>>>> I'm not quite sure where to go from here.
>>>>
>>>> First time I had a chance to use the new recover tools. I was able to
>>>> reply the journal, reset it and then reset the sessionmap. MDS
>>>> returned back to life and so far everything looks good. Yay.
>>>>
>>>> Triggering this a bug/issue is a pretty interesting set of steps.
>>>
>>> Spoke too soon, a missing dir is now causing MDS to restart it self.
>>>
>>>     -6> 2015-10-11 22:40:47.300169 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300168, event: finishing request,
>>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>>     -5> 2015-10-11 22:40:47.300208 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300208, event: cleaned up request,
>>> op: client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>>     -4> 2015-10-11 22:40:47.300231 7f580c7b9700  5 -- op tracker --
>>> seq: 4, time: 2015-10-11 22:40:47.300231, event: done, op:
>>> client_request(client.3597476:21480382 rmdir #100015e0be2/58
>>> 2015-10-11 21:34:49.224905 RETRY=36)
>>>     -3> 2015-10-11 22:40:47.300284 7f580e0bd700  0
>>> mds.0.cache.dir(100048df076) _fetched missing object for [dir
>>> 100048df076 /petabucket/beta/6d/f6/ [2,head] auth v=0 cv=0/0 ap=1+0+0
>>> state=1073741952 f() n() hs=0+0,ss=0+0 | waiter=1 authpin=1 0x6debcf8]
>>>     -2> 2015-10-11 22:40:47.300313 7f580e0bd700 -1
>>> log_channel(cluster) log [ERR] : dir 100048df076 object missing on
>>> disk; some files may be lost
>>>     -1> 2015-10-11 22:40:47.300549 7f580c7b9700  5 -- op tracker --
>>> seq: 6, time: 2015-10-11 22:40:47.300549, event: acquired locks, op:
>>> client_request(client.3710941:1805 mkdir
>>> #100048df076/1dd4c66b916cf55f19beafc0574ac2dd923bc7efcc5b7757198f323e04d0.dl
>>> 2015-10-11 22:36:32.489705 RETRY=33)
>>>      0> 2015-10-11 22:40:47.303770 7f580c7b9700 -1 mds/CDir.cc: In
>>> function 'fnode_t* CDir::project_fnode()' thread 7f580c7b9700 time
>>> 2015-10-11 22:40:47.300581
>>> mds/CDir.cc: 1267: FAILED assert(get_version() != 0)
>>>
>>>  ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x8b) [0x94cc1b]
>>>  2: (CDir::project_fnode()+0x2da) [0x75da1a]
>>
>> Worked around this as well. Stopped the client that was trying to
>> access this dir (or delete it, not sure) and move it somewhere else
>> (because I can't delete it).
>>
>> A series of steps that happened today (not sure if all are related) in
>> case the postmortem is interesting to anybody following at home.
>>
>> 1. One (of 5) monitors somehow gets OOM killed after eating up 10gigs of ram.
>> 2. An OSD blocks on an IO request
>> 3. This is when the MSD starts growing it's SessionMap to a place
>> where eventually it gets to big. This is related to bug:
>> http://tracker.ceph.com/issues/10449 (not the client bug, but the too
>> large Session)
>> 4. MDS suicides and then two MDS machines start ping ponging in
>> recovery but always fail to save SessionMap.
>> 5. I stop the MDS, recover the journal and reset the session map.
>
> Exactly what did you do in this step? I have a suspicion you didn't
> recover and flush out the journal all the way, and that's what caused
> your directory to go bad.
> -Greg

I followed the directions from the hammer recovery page on cephfs. The
backup steps, recovery, truncation and the session clear in order.

>
>> 6. MDS come back to life after that.
>> 7. Ceph loses one directory and as soon as the client tries to access
>> it the MDS also suicides. Standing bug:
>> http://tracker.ceph.com/issues/9880
>> 8. Lame workaround (moving / renaming bad dir) so client doesn't access it.
>>
>> --
>> Milosz Tanski
>> CTO
>> 16 East 34th Street, 15th floor
>> New York, NY 10016
>>
>> p: 646-253-9055
>> e: milosz@xxxxxxxxx

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html