Re: mds crashing

Adam Tygart <mozes@xxxxxxx> · Wed, 15 Apr 2015 20:07:53 -0500

We are using 3.18.6-gentoo. Based on that, I was hoping that the
kernel bug referred to in the bug report would have been fixed.

--
Adam

On Wed, Apr 15, 2015 at 8:02 PM, Yan, Zheng <ukernel@xxxxxxxxx> wrote:
> On Thu, Apr 16, 2015 at 5:29 AM, Kyle Hutson <kylehutson@xxxxxxx> wrote:
>> Thank you, John!
>>
>> That was exactly the bug we were hitting. My Google-fu didn't lead me to
>> this one.
>
>
> here is the bug report http://tracker.ceph.com/issues/10449. It's a
> kernel client bug which causes the session map size increase
> infinitely. which version of linux kernel are using?
>
> Regards
> Yan, Zheng
>
>
>>
>> On Wed, Apr 15, 2015 at 4:16 PM, John Spray <john.spray@xxxxxxxxxx> wrote:
>>>
>>> On 15/04/2015 20:02, Kyle Hutson wrote:
>>>>
>>>> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going
>>>> pretty well.
>>>>
>>>> Then, about noon today, we had an mds crash. And then the failover mds
>>>> crashed. And this cascaded through all 4 mds servers we have.
>>>>
>>>> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears
>>>> to be OK for a little while. ceph -w goes through 'replay' 'reconnect'
>>>> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting to
>>>> 'active', it crashes again.
>>>>
>>>> I have the mds log at
>>>> http://people.beocat.cis.ksu.edu/~kylehutson/ceph-mds.hobbit01.log
>>>> <http://people.beocat.cis.ksu.edu/%7Ekylehutson/ceph-mds.hobbit01.log>
>>>>
>>>> For the possibly, but not necessarily, useful background info.
>>>> - Yesterday we took our erasure coded pool and increased both pg_num and
>>>> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%),
>>>> but those seem to be continuing to clean themselves up.
>>>> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph)
>>>> filesystem to this filesystem.
>>>> - Before we realized the mds crashes, we had just changed the size of our
>>>> metadata pool from 2 to 4.
>>>
>>>
>>> It looks like you're seeing http://tracker.ceph.com/issues/10449, which is
>>> a situation where the SessionMap object becomes too big for the MDS to
>>> save.The cause of it in that case was stuck requests from a misbehaving
>>> client running a slightly older kernel.
>>>
>>> Assuming you're using the kernel client and having a similar problem, you
>>> could try to work around this situation by forcibly unmounting the clients
>>> while the MDS is offline, such that during clientreplay the MDS will remove
>>> them from the SessionMap after timing out, and then next time it tries to
>>> save the map it won't be oversized.  If that works, you could then look into
>>> getting newer kernels on the clients to avoid hitting the issue again -- the
>>> #10449 ticket has some pointers about which kernel changes were relevant.
>>>
>>> Cheers,
>>> John
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com