mds stuck in replay state

Tren Blackburn <tren@xxxxxxxxxxxxxxx> · Thu, 20 Sep 2012 09:42:52 -0700

Hi List;

Still rsyncing the same data as the last ticket. However the mds for
some reason is stuck in "replay" state. I've tried restarting the mds
process to get it to fail over to another node, but regardless of
which node is the active mds, it still is in replay state. Not sure
how to diagnose this further. This is what I see in the logs:

2012-09-20 09:34:46.366127 7f749fade780  0 ceph version 0.51
(commit:c03ca95d235c9a072dcd8a77ad5274a52e93ae30), process ceph-mds,
pid 11115
2012-09-20 09:34:46.368248 7f749aaec700  0 mds.-1.0 ms_handle_connect
on 10.87.1.104:6789/0
2012-09-20 09:34:46.505150 7f749aaec700  1 mds.-1.0 handle_mds_map standby
2012-09-20 09:38:57.987721 7f749aaec700  1 mds.0.14 handle_mds_map i
am now mds.0.14
2012-09-20 09:38:57.987724 7f749aaec700  1 mds.0.14 handle_mds_map
state change up:standby --> up:replay
2012-09-20 09:38:57.987727 7f749aaec700  1 mds.0.14 replay_start
2012-09-20 09:38:57.987736 7f749aaec700  1 mds.0.14  recovery set is
2012-09-20 09:38:57.987741 7f749aaec700  1 mds.0.14  need osdmap epoch
356, have 310
2012-09-20 09:38:57.987743 7f749aaec700  1 mds.0.14  waiting for
osdmap 356 (which blacklists prior instance)
2012-09-20 09:38:57.987783 7f749aaec700  1 mds.0.cache
handle_mds_failure mds.0 : recovery peers are
2012-09-20 09:38:58.282446 7f749aaec700  0 mds.0.14 ms_handle_connect
on 10.87.1.104:6852/32172
2012-09-20 09:38:58.282495 7f749aaec700  0 mds.0.14 ms_handle_connect
on 10.87.1.96:6809/12082
2012-09-20 09:38:58.282562 7f749aaec700  0 mds.0.14 ms_handle_connect
on 10.87.1.103:6854/24256
2012-09-20 09:38:58.282661 7f749aaec700  0 mds.0.14 ms_handle_connect
on 10.87.1.93:6800/2661
2012-09-20 09:38:58.284226 7f749aaec700  0 mds.0.14 ms_handle_connect
on 10.87.1.95:6812/17871
2012-09-20 09:38:58.284258 7f749aaec700  0 mds.0.14 ms_handle_connect
on 10.87.1.90:6815/8616
2012-09-20 09:38:58.304331 7f749aaec700  0 mds.0.14 ms_handle_connect
on 10.87.1.100:6848/16139
2012-09-20 09:38:58.314442 7f749aaec700  0 mds.0.cache creating system
inode with ino:100
2012-09-20 09:38:58.314695 7f749aaec700  0 mds.0.cache creating system
inode with ino:1

The cluster is currently IO locked. Looks like the mds server isn't
that stable yet. I haven't managed to have a single failover between
mds's go smoothly yet.

Thanks in advance for your help!

t.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html