On Tue, Dec 15, 2015 at 5:01 PM, Bryan Wright <bkw1a@xxxxxxxxxxxx> wrote: > Hi folks, > > This morning, one of my MDSes dropped into "replaying": > > mds cluster is degraded > mds.0 at 192.168.1.31:6800/12550 rank 0 is replaying journal > > and the ceph filesystem seems to be unavailable to the clients. Is there > any way to see the progress of this replay? I don't see any indication in > the logs or elsewhere that it's actually doing anything. If it's safe to > truncate the journal, I'd be fine with just losing any changes made since > this morning, in order to get the filesystem back online. While you may see people dropping journals from time to time in disaster situations, be aware that it is not as simple as losing recent changes. Dropping the journal can easily leave you with an inconsistent filesystem (e.g. you appended to files, but updates to their size metadata are lost). I'm mainly mentioning this for the benefit of the list archive, as the topic of resetting journals comes up a fair bit. Anyway -- you'll need to do some local poking of the MDS to work out what the hold up is. Turn up MDS debug logging[1] and see what's it's saying during the replay. Also, you can use performance counters "ceph daemon mds.<id> perf dump" and see which are incrementing to get an idea of what it's doing. The "rd_pos" value from "perf dump mds_log" should increment during replay. If you haven't already, also check the overall health of the MDS host, e.g. is it low on memory/swapping? John 1. http://docs.ceph.com/docs/master/rados/configuration/ceph-conf/#runtime-changes _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com