I knew I forgot to include something with my initial e-mail. Single active with failover. dumped mdsmap epoch 30608 epoch 30608 flags 0 created 2015-04-02 16:15:55.209894 modified 2015-05-22 11:39:15.992774 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 17592186044416 last_failure 30606 last_failure_osd_epoch 24298 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table} max_mds 1 in 0 up {0=20284976} failed stopped data_pools 25 metadata_pool 27 inline_data disabled 20285024: 10.5.38.2:7021/32024 'hobbit02' mds.-1.0 up:standby seq 1 20346784: 10.5.38.1:6957/223554 'hobbit01' mds.-1.0 up:standby seq 1 20284976: 10.5.38.13:6926/66700 'hobbit13' mds.0.1696 up:replay seq 1 -- Adam On Fri, May 22, 2015 at 11:37 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote: > I've experienced MDS issues in the past, but nothing sticks out to me in your logs. > > Are you using a single active MDS with failover, or multiple active MDS? > > --Lincoln > > On May 22, 2015, at 10:10 AM, Adam Tygart wrote: > >> Thanks for the quick response. >> >> I had 'debug mds = 20' in the first log, I added 'debug ms = 1' for this one: >> https://drive.google.com/file/d/0B4XF1RWjuGh5bXFnRzE1SHF6blE/view?usp=sharing >> >> Based on these logs, it looks like heartbeat_map is_healthy 'MDS' just >> times out and then the mds gets respawned. >> >> -- >> Adam >> >> On Fri, May 22, 2015 at 9:42 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote: >>> Hi Adam, >>> >>> You can get the MDS to spit out more debug information like so: >>> >>> # ceph mds tell 0 injectargs '--debug-mds 20 --debug-ms 1' >>> >>> At least then you can see where it's at when it crashes. >>> >>> --Lincoln >>> >>> On May 22, 2015, at 9:33 AM, Adam Tygart wrote: >>> >>>> Hello all, >>>> >>>> The ceph-mds servers in our cluster are performing a constant >>>> boot->replay->crash in our systems. >>>> >>>> I have enable debug logging for the mds for a restart cycle on one of >>>> the nodes[1]. >>>> >>>> Kernel debug from cephfs client during reconnection attempts: >>>> [732586.352173] ceph: mdsc delayed_work >>>> [732586.352178] ceph: check_delayed_caps >>>> [732586.352182] ceph: lookup_mds_session ffff88202f01c000 210 >>>> [732586.352185] ceph: mdsc get_session ffff88202f01c000 210 -> 211 >>>> [732586.352189] ceph: send_renew_caps ignoring mds0 (up:replay) >>>> [732586.352192] ceph: add_cap_releases ffff88202f01c000 mds0 extra 680 >>>> [732586.352195] ceph: mdsc put_session ffff88202f01c000 211 -> 210 >>>> [732586.352198] ceph: mdsc delayed_work >>>> [732586.352200] ceph: check_delayed_caps >>>> [732586.352202] ceph: lookup_mds_session ffff881036cbf800 1 >>>> [732586.352205] ceph: mdsc get_session ffff881036cbf800 1 -> 2 >>>> [732586.352207] ceph: send_renew_caps ignoring mds0 (up:replay) >>>> [732586.352210] ceph: add_cap_releases ffff881036cbf800 mds0 extra 680 >>>> [732586.352212] ceph: mdsc put_session ffff881036cbf800 2 -> 1 >>>> [732591.357123] ceph: mdsc delayed_work >>>> [732591.357128] ceph: check_delayed_caps >>>> [732591.357132] ceph: lookup_mds_session ffff88202f01c000 210 >>>> [732591.357135] ceph: mdsc get_session ffff88202f01c000 210 -> 211 >>>> [732591.357139] ceph: add_cap_releases ffff88202f01c000 mds0 extra 680 >>>> [732591.357142] ceph: mdsc put_session ffff88202f01c000 211 -> 210 >>>> [732591.357145] ceph: mdsc delayed_work >>>> [732591.357147] ceph: check_delayed_caps >>>> [732591.357149] ceph: lookup_mds_session ffff881036cbf800 1 >>>> [732591.357152] ceph: mdsc get_session ffff881036cbf800 1 -> 2 >>>> [732591.357154] ceph: add_cap_releases ffff881036cbf800 mds0 extra 680 >>>> [732591.357157] ceph: mdsc put_session ffff881036cbf800 2 -> 1 >>>> [732596.362076] ceph: mdsc delayed_work >>>> [732596.362081] ceph: check_delayed_caps >>>> [732596.362084] ceph: lookup_mds_session ffff88202f01c000 210 >>>> [732596.362087] ceph: mdsc get_session ffff88202f01c000 210 -> 211 >>>> [732596.362091] ceph: add_cap_releases ffff88202f01c000 mds0 extra 680 >>>> [732596.362094] ceph: mdsc put_session ffff88202f01c000 211 -> 210 >>>> [732596.362097] ceph: mdsc delayed_work >>>> [732596.362099] ceph: check_delayed_caps >>>> [732596.362101] ceph: lookup_mds_session ffff881036cbf800 1 >>>> [732596.362104] ceph: mdsc get_session ffff881036cbf800 1 -> 2 >>>> [732596.362106] ceph: add_cap_releases ffff881036cbf800 mds0 extra 680 >>>> [732596.362109] ceph: mdsc put_session ffff881036cbf800 2 -> 1 >>>> >>>> Anybody have any debugging tips, or have any ideas on how to get an mds stable? >>>> >>>> Server info: CentOS 7.1 with Ceph 0.94.1 >>>> Client info: Gentoo, kernel cephfs. 3.19.5-gentoo >>>> >>>> I'd reboot the client, but at this point, I don't believe this is a >>>> client issue. >>>> >>>> [1] https://drive.google.com/file/d/0B4XF1RWjuGh5WU1OZXpNb0Z1ck0/view?usp=sharing >>>> >>>> -- >>>> Adam >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com