Re: mds crashing

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 16 Apr 2015 10:04:57 +0800

On Thu, Apr 16, 2015 at 9:48 AM, Adam Tygart <mozes@xxxxxxx> wrote:
> What is significantly smaller? We have 67 requests in the 16,400,000
> range and 250 in the 18,900,000 range.
>

that explains the crash. could you help me to debug this issue.

 send /sys/kernel/debug/ceph/*/mdsc to me.

 run "echo module ceph +p > /sys/kernel/debug/dynamic_debug/control"
on the cephfs mount machine
 restart the mds and wait until it crash again
 run "echo module ceph -p > /sys/kernel/debug/dynamic_debug/control"
on the cephfs mount machine
 send kernel message of the cephfs mount machine to me (should in
/var/log/kerne.log or /var/log/message)

to recover from the crash. you can either force reset the machine
contains cephfs mount or add "mds wipe sessions = 1" to mds section of
ceph.conf

Regards
Yan, Zheng

> Thanks,
>
> Adam
>
> On Wed, Apr 15, 2015 at 8:38 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>> On Thu, Apr 16, 2015 at 9:07 AM, Adam Tygart <mozes@xxxxxxx> wrote:
>>> We are using 3.18.6-gentoo. Based on that, I was hoping that the
>>> kernel bug referred to in the bug report would have been fixed.
>>>
>>
>> The bug was supposed to be fixed, but you hit the bug again. could you
>> check if the kernel client has any hang mds request. (check
>> /sys/kernel/debug/ceph/*/mdsc on the machine that contain cephfs
>> mount. If there is any request whose ID is significant smaller than
>> other requests' IDs)
>>
>> Regards
>> Yan, Zheng
>>
>>> --
>>> Adam
>>>
>>> On Wed, Apr 15, 2015 at 8:02 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>>>> On Thu, Apr 16, 2015 at 5:29 AM, Kyle Hutson <kylehutson@xxxxxxx> wrote:
>>>>> Thank you, John!
>>>>>
>>>>> That was exactly the bug we were hitting. My Google-fu didn't lead me to
>>>>> this one.
>>>>
>>>>
>>>> here is the bug report http://tracker.ceph.com/issues/10449. It's a
>>>> kernel client bug which causes the session map size increase
>>>> infinitely. which version of linux kernel are using?
>>>>
>>>> Regards
>>>> Yan, Zheng
>>>>
>>>>
>>>>>
>>>>> On Wed, Apr 15, 2015 at 4:16 PM, John Spray <john.spray@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> On 15/04/2015 20:02, Kyle Hutson wrote:
>>>>>>>
>>>>>>> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going
>>>>>>> pretty well.
>>>>>>>
>>>>>>> Then, about noon today, we had an mds crash. And then the failover mds
>>>>>>> crashed. And this cascaded through all 4 mds servers we have.
>>>>>>>
>>>>>>> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears
>>>>>>> to be OK for a little while. ceph -w goes through 'replay' 'reconnect'
>>>>>>> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting to
>>>>>>> 'active', it crashes again.
>>>>>>>
>>>>>>> I have the mds log at
>>>>>>> http://people.beocat.cis.ksu.edu/~kylehutson/ceph-mds.hobbit01.log
>>>>>>> <http://people.beocat.cis.ksu.edu/%7Ekylehutson/ceph-mds.hobbit01.log>
>>>>>>>
>>>>>>> For the possibly, but not necessarily, useful background info.
>>>>>>> - Yesterday we took our erasure coded pool and increased both pg_num and
>>>>>>> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%),
>>>>>>> but those seem to be continuing to clean themselves up.
>>>>>>> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph)
>>>>>>> filesystem to this filesystem.
>>>>>>> - Before we realized the mds crashes, we had just changed the size of our
>>>>>>> metadata pool from 2 to 4.
>>>>>>
>>>>>>
>>>>>> It looks like you're seeing http://tracker.ceph.com/issues/10449, which is
>>>>>> a situation where the SessionMap object becomes too big for the MDS to
>>>>>> save.The cause of it in that case was stuck requests from a misbehaving
>>>>>> client running a slightly older kernel.
>>>>>>
>>>>>> Assuming you're using the kernel client and having a similar problem, you
>>>>>> could try to work around this situation by forcibly unmounting the clients
>>>>>> while the MDS is offline, such that during clientreplay the MDS will remove
>>>>>> them from the SessionMap after timing out, and then next time it tries to
>>>>>> save the map it won't be oversized.  If that works, you could then look into
>>>>>> getting newer kernels on the clients to avoid hitting the issue again -- the
>>>>>> #10449 ticket has some pointers about which kernel changes were relevant.
>>>>>>
>>>>>> Cheers,
>>>>>> John
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com