Re: MDS: How to increase timeouts?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 15 Dec 2015 13:22:45 -0800

On Tue, Dec 15, 2015 at 10:21 AM, Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> I have a setup with two MDS in active/standby configuration. During times of
> high network load / network congestion, the active MDS is bounced between
> both instances:
>
> 1. mons(?) decide that MDS A is crashed/not available due to missing
> heartbeats
>
> 2015-12-15 16:38:08.471608 7f880df10700  1 mds.beacon.ceph-storage-01 _send
> skipping beacon, heartbeat map not healthy
> 2015-12-15 16:38:10.534941 7f8813e4b700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> ...
> 2015-12-15 16:38:15.468190 7f880e711700  1 heartbeat_map reset_timeout 'MDS'
> had timed out after 15
> 2015-12-15 16:38:17.734172 7f8811818700  1 mds.-1.-1 handle_mds_map i
> (192.168.6.129:6825/2846) dne in the mdsmap, respawning myself
>
> 2. Failover to standby MDS B
> 3. MDS B starts recover/rejoin (takes up to 15 minutes), introducing even
> more load
> 4. MDS A is respawned as new standby MDS
> 5. mons kick out MDS B after timeout
> 6. Failover to MDS A

It takes 15 minutes to work through rejoin on your MDS? :/ You might
try running your daemons in standby-replay instead of just standby, so
that they have a warm cache.

You could also try to figure out if the limiting factor is MDS
throughput or OSD IOPs.
-Greg

> ....
>
> I resolve this situation by shutting down one MDS completely and force the
> cluster to use the remaining one:
> 2015-12-15 16:40:20.618774 7f7ca6946700  1 mds.0.80 reconnect_done
> 2015-12-15 16:40:25.815374 7f7ca6946700  1 mds.0.80 handle_mds_map i am now
> mds.0.80
> 2015-12-15 16:40:25.815384 7f7ca6946700  1 mds.0.80 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2015-12-15 16:40:25.815388 7f7ca6946700  1 mds.0.80 rejoin_start
> 2015-12-15 16:40:43.159921 7f7ca894a700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:40:44.619252 7f7ca303e700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:40:44.619260 7f7ca303e700  1 mds.beacon.cb-dell-pe620r _send
> skipping beacon, heartbeat map not healthy
> 2015-12-15 16:40:48.160055 7f7ca894a700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:40:48.619315 7f7ca303e700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:40:48.619322 7f7ca303e700  1 mds.beacon.cb-dell-pe620r _send
> skipping beacon, heartbeat map not healthy
> 2015-12-15 16:40:52.619380 7f7ca303e700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:40:52.619405 7f7ca303e700  1 mds.beacon.cb-dell-pe620r _send
> skipping beacon, heartbeat map not healthy
> 2015-12-15 16:40:53.160157 7f7ca894a700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:40:56.619442 7f7ca303e700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:40:56.619452 7f7ca303e700  1 mds.beacon.cb-dell-pe620r _send
> skipping beacon, heartbeat map not healthy
> 2015-12-15 16:40:58.160257 7f7ca894a700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:41:00.619510 7f7ca303e700  1 heartbeat_map is_healthy 'MDS'
> had timed out after 15
> 2015-12-15 16:41:00.619519 7f7ca303e700  1 mds.beacon.cb-dell-pe620r _send
> skipping beacon, heartbeat map not healthy
> 2015-12-15 16:41:01.339416 7f7ca6946700  1 mds.0.80 rejoin_joint_start
> 2015-12-15 16:41:01.343018 7f7ca383f700  1 heartbeat_map reset_timeout 'MDS'
> had timed out after 15
> 2015-12-15 16:41:05.756154 7f7ca6946700  0 mds.beacon.cb-dell-pe620r
> handle_mds_beacon no longer laggy
> 2015-12-15 16:51:39.648189 7f7ca283d700  1 mds.0.80 rejoin_done
> 2015-12-15 16:51:40.932766 7f7ca6946700  1 mds.0.80 handle_mds_map i am now
> mds.0.80
> 2015-12-15 16:51:40.932783 7f7ca6946700  1 mds.0.80 handle_mds_map state
> change up:rejoin --> up:active
> 2015-12-15 16:51:40.932788 7f7ca6946700  1 mds.0.80 recovery_done --
> successful recovery!
> 2015-12-15 16:51:43.235230 7f7ca6946700  1 mds.0.80 active_start
> 2015-12-15 16:51:43.279305 7f7ca6946700  1 mds.0.80 cluster recovered.
> ...
>
> How do I prevent the mons from removing the active MDS from the mdsmap or
> allow a larger timeout? The documentation mentions mds_beacon_grace and
> mds_beacon_interval, but it is not clear how these correlate to the
> timeouts.
>
> How do I have to change the configuration to allow network congestions of up
> to 5 minutes?
>
> Best regards,
> Burkhard
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com