Re: Ceph MDS continually respawning (hammer)

Adam Tygart <mozes@xxxxxxx> · Fri, 22 May 2015 11:40:21 -0500

I knew I forgot to include something with my initial e-mail.

Single active with failover.

dumped mdsmap epoch 30608
epoch   30608
flags   0
created 2015-04-02 16:15:55.209894
modified        2015-05-22 11:39:15.992774
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   17592186044416
last_failure    30606
last_failure_osd_epoch  24298
compat  compat={},rocompat={},incompat={1=base v0.20,2=client
writeable ranges,3=default file layouts on dirs,4=dir inode in
separate object,5=mds uses versioned encoding,6=dirfrag is stored in
omap,8=no anchor table}
max_mds 1
in      0
up      {0=20284976}
failed
stopped
data_pools      25
metadata_pool   27
inline_data     disabled
20285024:       10.5.38.2:7021/32024 'hobbit02' mds.-1.0 up:standby seq 1
20346784:       10.5.38.1:6957/223554 'hobbit01' mds.-1.0 up:standby seq 1
20284976:       10.5.38.13:6926/66700 'hobbit13' mds.0.1696 up:replay seq 1

--
Adam

On Fri, May 22, 2015 at 11:37 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
> I've experienced MDS issues in the past, but nothing sticks out to me in your logs.
>
> Are you using a single active MDS with failover, or multiple active MDS?
>
> --Lincoln
>
> On May 22, 2015, at 10:10 AM, Adam Tygart wrote:
>
>> Thanks for the quick response.
>>
>> I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this one:
>> https://drive.google.com/file/d/0B4XF1RWjuGh5bXFnRzE1SHF6blE/view?usp=sharing
>>
>> Based on these logs, it looks like heartbeat_map is_healthy 'MDS' just
>> times out and then the mds gets respawned.
>>
>> --
>> Adam
>>
>> On Fri, May 22, 2015 at 9:42 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
>>> Hi Adam,
>>>
>>> You can get the MDS to spit out more debug information like so:
>>>
>>>        # ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1'
>>>
>>> At least then you can see where it's at when it crashes.
>>>
>>> --Lincoln
>>>
>>> On May 22, 2015, at 9:33 AM, Adam Tygart wrote:
>>>
>>>> Hello all,
>>>>
>>>> The ceph-mds servers in our cluster are performing a constant
>>>> boot->replay->crash in our systems.
>>>>
>>>> I have enable debug logging for the mds for a restart cycle on one of
>>>> the nodes[1].
>>>>
>>>> Kernel debug from cephfs client during reconnection attempts:
>>>> [732586.352173] ceph:  mdsc delayed_work
>>>> [732586.352178] ceph:  check_delayed_caps
>>>> [732586.352182] ceph:  lookup_mds_session ffff88202f01c000 210
>>>> [732586.352185] ceph:  mdsc get_session ffff88202f01c000 210 -> 211
>>>> [732586.352189] ceph:  send_renew_caps ignoring mds0 (up:replay)
>>>> [732586.352192] ceph:  add_cap_releases ffff88202f01c000 mds0 extra 680
>>>> [732586.352195] ceph:  mdsc put_session ffff88202f01c000 211 -> 210
>>>> [732586.352198] ceph:  mdsc delayed_work
>>>> [732586.352200] ceph:  check_delayed_caps
>>>> [732586.352202] ceph:  lookup_mds_session ffff881036cbf800 1
>>>> [732586.352205] ceph:  mdsc get_session ffff881036cbf800 1 -> 2
>>>> [732586.352207] ceph:  send_renew_caps ignoring mds0 (up:replay)
>>>> [732586.352210] ceph:  add_cap_releases ffff881036cbf800 mds0 extra 680
>>>> [732586.352212] ceph:  mdsc put_session ffff881036cbf800 2 -> 1
>>>> [732591.357123] ceph:  mdsc delayed_work
>>>> [732591.357128] ceph:  check_delayed_caps
>>>> [732591.357132] ceph:  lookup_mds_session ffff88202f01c000 210
>>>> [732591.357135] ceph:  mdsc get_session ffff88202f01c000 210 -> 211
>>>> [732591.357139] ceph:  add_cap_releases ffff88202f01c000 mds0 extra 680
>>>> [732591.357142] ceph:  mdsc put_session ffff88202f01c000 211 -> 210
>>>> [732591.357145] ceph:  mdsc delayed_work
>>>> [732591.357147] ceph:  check_delayed_caps
>>>> [732591.357149] ceph:  lookup_mds_session ffff881036cbf800 1
>>>> [732591.357152] ceph:  mdsc get_session ffff881036cbf800 1 -> 2
>>>> [732591.357154] ceph:  add_cap_releases ffff881036cbf800 mds0 extra 680
>>>> [732591.357157] ceph:  mdsc put_session ffff881036cbf800 2 -> 1
>>>> [732596.362076] ceph:  mdsc delayed_work
>>>> [732596.362081] ceph:  check_delayed_caps
>>>> [732596.362084] ceph:  lookup_mds_session ffff88202f01c000 210
>>>> [732596.362087] ceph:  mdsc get_session ffff88202f01c000 210 -> 211
>>>> [732596.362091] ceph:  add_cap_releases ffff88202f01c000 mds0 extra 680
>>>> [732596.362094] ceph:  mdsc put_session ffff88202f01c000 211 -> 210
>>>> [732596.362097] ceph:  mdsc delayed_work
>>>> [732596.362099] ceph:  check_delayed_caps
>>>> [732596.362101] ceph:  lookup_mds_session ffff881036cbf800 1
>>>> [732596.362104] ceph:  mdsc get_session ffff881036cbf800 1 -> 2
>>>> [732596.362106] ceph:  add_cap_releases ffff881036cbf800 mds0 extra 680
>>>> [732596.362109] ceph:  mdsc put_session ffff881036cbf800 2 -> 1
>>>>
>>>> Anybody have any debugging tips, or have any ideas on how to get an mds stable?
>>>>
>>>> Server info: CentOS 7.1 with Ceph 0.94.1
>>>> Client info: Gentoo, kernel cephfs. 3.19.5-gentoo
>>>>
>>>> I'd reboot the client, but at this point, I don't believe this is a
>>>> client issue.
>>>>
>>>> [1] https://drive.google.com/file/d/0B4XF1RWjuGh5WU1OZXpNb0Z1ck0/view?usp=sharing
>>>>
>>>> --
>>>> Adam
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com