Thanks John,
I got this in the mds log too:
2017-07-11 07:10:06.293219 7f1836837700 1 mds.beacon.b _send skipping beacon, heartbeat map not healthy
2017-07-11 07:10:08.330979 7f183b942700 1 heartbeat_map is_healthy 'MDSRank' had timed out after 15
but that respawn happened 2 minutes after I got this:
2017-07-11 07:10:10.948237 7f183993e700 0 mds.beacon.b handle_mds_beacon no longer laggy
Which makes me confused. Could it be a Network issue? Local network communication was fine by then. It might be a bug.
When it was recovering it was stuck at rejoin_joint_start state for almost 50 minutes.
2017-07-11 07:13:36.587188 7f264a112700 1 mds.0.890528 rejoin_joint_start
[...]
2017-07-11 07:56:21.521006 7f0f78917700 1 mds.0.890537 recovery_done -- successful recovery!
2017-07-11 07:56:21.522570 7f0f78917700 1 mds.0.890537 active_start
2017-07-11 07:56:21.533507 7f0f78917700 1 mds.0.890537 cluster recovered.
I watched with "ceph daemon mds.b perf dump mds" that it was scanning the inodes. But when this happens (quite often) I have no idea when it will stop.
Many other times this happened was because of a crash (http://tracker.ceph.com/issues/20535) but today was not the case.
Webert Lima
DevOps Engineer at MAV Tecnologia
Belo Horizonte - Brasil
On Tue, Jul 11, 2017 at 11:36 AM, John Spray <jspray@xxxxxxxxxx> wrote:
On Tue, Jul 11, 2017 at 3:23 PM, Webert de Souza Lima
<webert.boss@xxxxxxxxx> wrote:
> Hello,
>
> today I got a MDS respawn with the following message:
>
> 2017-07-11 07:07:55.397645 7ffb7a1d7700 1 mds.b handle_mds_map i
> (10.0.1.2:6822/28190) dne in the mdsmap, respawning myself
"dne in the mdsmap" is what an MDS says when the monitors have
concluded that the MDS is dead, but the MDS is really alive. "dne"
stands for "does not exist", so the MDS is complaining that it has
been removed from the mdsmap.
The message could definitely be better worded!
You can see this happen in certain buggy cases where the MDS is
failing to send beacon messages to the mons, even though it is really
alive -- if you're stuck in rejoin, then that is probably related: try
increasing the log verbosity to work out where the MDS is stuck while
it's sitting in the rejoin state.
John
>
> it happened 3 times within 5 minutes. After so, the MDS took 50 minutes to
> recover.
> I can't find what exactly that message means and how to avoid it.
>
> I'll be glad to provide any further information. Thanks!
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> Belo Horizonte - Brasil
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
>
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com