mds stuck in clientreplay state after failover

Dong Yuan <yuandong1222@xxxxxxxxx> · Wed, 9 Oct 2013 13:21:51 +0800

I think I found a bug about the clientreply of mds. (different from #4742)

After failover, the standby mds begin to recover and enter the
clientreply state and never moves to the next state(Active).

I gdbed the mds process by gcore and found that the main thread
(dispatch thread) is idle and  mdcache->active_request is empty, but
mds->replay_queue still has one element, that is strange.

>From the code, replay_queue has all requests which need to be
replayed. When the mds enters clientreplay state,
MDS::queue_one_replay will be called to pick a requect from the
replay_queue and put the request into finished_queue. So the replay
operation begins to work.

After the first replay request has finished,  MDS::queue_one_replay
should be called again to deal with the next replay request. There are
three paths to do this:
1) Server::journal_and_reply
2) MDCache::reqeust_cleanup
3) Server::handle_client_request

But it seems that no path called the MDS::queue_one_replay method. As
a result,  the mds stuck in clientreplay state.

Maybe there is a request process path which will never use the above
three methed. But I can't find the previous request while it seems to
completed and cleaned up from the MDCache.

There is any one has some idea about these problem?

I can give more details if needed. I have the core dump but it is too
big (300MB+) to upload.

Thanks for any help.

-- 
Dong Yuan
Email:yuandong1222@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html