Re: MDS stuck replaying

John Spray <jspray@xxxxxxxxxx> · Tue, 15 Dec 2015 18:03:31 +0000

On Tue, Dec 15, 2015 at 5:01 PM, Bryan Wright <bkw1a@xxxxxxxxxxxx> wrote:
> Hi folks,
>
> This morning, one of my MDSes dropped into "replaying":
>
> mds cluster is degraded
> mds.0 at 192.168.1.31:6800/12550 rank 0 is replaying journal
>
> and the ceph filesystem seems to be unavailable to the clients.  Is there
> any way to see the progress of this replay?  I don't see any indication in
> the logs or elsewhere that it's actually doing anything.  If it's safe to
> truncate the journal, I'd be fine with just losing any changes made since
> this morning, in order to get the filesystem back online.

While you may see people dropping journals from time to time in
disaster situations, be aware that it is not as simple as losing
recent changes.  Dropping the journal can easily leave you with an
inconsistent filesystem (e.g. you appended to files, but updates to
their size metadata are lost).  I'm mainly mentioning this for the
benefit of the list archive, as the topic of resetting journals comes
up a fair bit.

Anyway -- you'll need to do some local poking of the MDS to work out
what the hold up is.   Turn up MDS debug logging[1] and see what's
it's saying during the replay.  Also, you can use performance counters
"ceph daemon mds.<id> perf dump" and see which are incrementing to get
an idea of what it's doing.  The "rd_pos" value from "perf dump
mds_log" should increment during replay.  If you haven't already, also
check the overall health of the MDS host, e.g. is it low on
memory/swapping?

John

1. http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/#runtime-changes
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com