Re: mds crashing

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 16 Apr 2015 09:02:54 +0800

On Thu, Apr 16, 2015 at 5:29 AM, Kyle Hutson <kylehutson@xxxxxxx> wrote:
> Thank you, John!
>
> That was exactly the bug we were hitting. My Google-fu didn't lead me to
> this one.

here is the bug report http://tracker.ceph.com/issues/10449. It's a
kernel client bug which causes the session map size increase
infinitely. which version of linux kernel are using?

Regards
Yan, Zheng

>
> On Wed, Apr 15, 2015 at 4:16 PM, John Spray <john.spray@xxxxxxxxxx> wrote:
>>
>> On 15/04/2015 20:02, Kyle Hutson wrote:
>>>
>>> I upgraded to 0.94.1 from 0.94 on Monday, and everything had been going
>>> pretty well.
>>>
>>> Then, about noon today, we had an mds crash. And then the failover mds
>>> crashed. And this cascaded through all 4 mds servers we have.
>>>
>>> If I try to start it ('service ceph start mds' on CentOS 7.1), it appears
>>> to be OK for a little while. ceph -w goes through 'replay' 'reconnect'
>>> 'rejoin' 'clientreplay' and 'active' but nearly immediately after getting to
>>> 'active', it crashes again.
>>>
>>> I have the mds log at
>>> http://people.beocat.cis.ksu.edu/~kylehutson/ceph-mds.hobbit01.log
>>> <http://people.beocat.cis.ksu.edu/%7Ekylehutson/ceph-mds.hobbit01.log>
>>>
>>> For the possibly, but not necessarily, useful background info.
>>> - Yesterday we took our erasure coded pool and increased both pg_num and
>>> pgp_num from 2048 to 4096. We still have several objects misplaced (~17%),
>>> but those seem to be continuing to clean themselves up.
>>> - We are in the midst of a large (300+ TB) rsync from our old (non-ceph)
>>> filesystem to this filesystem.
>>> - Before we realized the mds crashes, we had just changed the size of our
>>> metadata pool from 2 to 4.
>>
>>
>> It looks like you're seeing http://tracker.ceph.com/issues/10449, which is
>> a situation where the SessionMap object becomes too big for the MDS to
>> save.The cause of it in that case was stuck requests from a misbehaving
>> client running a slightly older kernel.
>>
>> Assuming you're using the kernel client and having a similar problem, you
>> could try to work around this situation by forcibly unmounting the clients
>> while the MDS is offline, such that during clientreplay the MDS will remove
>> them from the SessionMap after timing out, and then next time it tries to
>> save the map it won't be oversized.  If that works, you could then look into
>> getting newer kernels on the clients to avoid hitting the issue again -- the
>> #10449 ticket has some pointers about which kernel changes were relevant.
>>
>> Cheers,
>> John
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com