Re: MDS does not always failover to hot standby on reboot

John Spray <jspray@xxxxxxxxxx> · Tue, 4 Sep 2018 11:16:03 +0100

It's mds_beacon_grace.  Set that on the monitor to control the
replacement of laggy MDS daemons, and usually also set it to the same
value on the MDS daemon as it's used there for the daemon to hold off
on certain tasks if it hasn't seen a mon beacon recently.

John
On Mon, Sep 3, 2018 at 9:26 AM William Lawton <william.lawton@xxxxxxxxxx> wrote:
>
> Which configuration option determines the MDS timeout period?
>
>
>
> William Lawton
>
>
>
> From: Gregory Farnum <gfarnum@xxxxxxxxxx>
> Sent: Thursday, August 30, 2018 5:46 PM
> To: William Lawton <william.lawton@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  MDS does not always failover to hot standby on reboot
>
>
>
> Yes, this is a consequence of co-locating the MDS and monitors — if the MDS reports to its co-located monitor and both fail, the monitor cluster has to go through its own failure detection and then wait for a full MDS timeout period after that before it marks the MDS down. :(
>
>
>
> We might conceivably be able to optimize for this, but there's not a general solution. If you need to co-locate, one thing that would make it better without being a lot of work is trying to have the MDS connect to one of the monitors on a different host. You can do that by just restricting the list of monitors you feed it in the ceph.conf, although it's not a guarantee that will *prevent* it from connecting to its own monitor if there are failures or reconnects after first startup.
>
> -Greg
>
> On Thu, Aug 30, 2018 at 8:38 AM William Lawton <william.lawton@xxxxxxxxxx> wrote:
>
> Hi.
>
>
>
> We have a 5 node Ceph cluster (refer to ceph -s output at bottom of email). During resiliency tests we have an occasional problem when we reboot the active MDS instance and a MON instance together i.e.  dub-sitv-ceph-02 and dub-sitv-ceph-04. We expect the MDS to failover to the standby instance dub-sitv-ceph-01 which is in standby-replay mode, and 80% of the time it does with no problems. However, 20% of the time it doesn’t and the MDS_ALL_DOWN health check is not cleared until 30 seconds later when the rebooted dub-sitv-ceph-02 and dub-sitv-ceph-04 instances come back up.
>
>
>
> When the MDS successfully fails over to the standby we see in the ceph.log the following:
>
>
>
> 2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 50 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
>
> 2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 52 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
>
> 2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 54 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
>
>
>
> When the active MDS role does not failover to the standby the MDS_ALL_DOWN check is not cleared until after the rebooted instances have come back up e.g.:
>
>
>
> 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 55 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
>
> 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
>
> 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>
> 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
>
> 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
>
> 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
>
> 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive, 115 pgs peering (PG_AVAILABILITY)
>
> 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504 objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 69 pgs peering (PG_AVAILABILITY)
>
> 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 69 pgs peering)
>
> 2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572 objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
>
> 2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 1 : cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
>
> 2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 2 : cluster [WRN] message from mon.0 was stamped 0.817433s in the future, clocks not synchronized
>
> 2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
>
> 2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 227 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
>
> 2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 73 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
>
> 2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 78 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05)
>
> 2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 79 : cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max 0.05s
>
> 2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 80 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded
>
> 2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 83 : cluster [INF] daemon mds.dub-sitv-ceph-02 restarted
>
> 2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 85 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
>
> 2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 86 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
>
> 2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 88 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
>
> 2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 89 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
>
>
>
> In the MDS log we’ve noticed that when the issue occurs, at precisely the time when the active MDS/MON nodes are rebooted, the standby MDS instance briefly stops logging replay_done (as standby). This is shown in the log exert below where there is a 9s gap in these logs.
>
>
>
> 2018-08-25 03:30:00.085 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>
> 2018-08-25 03:30:01.091 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>
> 2018-08-25 03:30:10.332 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
>
> 2018-08-25 03:30:11.333 7f3abb303700  1 mds.0.0 replay_done (as standby)
>
>
>
> I’ve tried to reproduce the issue by rebooting each MDS instance in turn repeatedly 5 minutes apart but so far haven’t been able to do so, so my assumption is that rebooting the MDS and a MON instance at the same time is a significant factor.
>
>
>
> Our mds_standby* configuration is set as follows:
>
>
>
>     "mon_force_standby_active": "true",
>
>     "mds_standby_for_fscid": "-1",
>
>     "mds_standby_for_name": "",
>
>     "mds_standby_for_rank": "0",
>
>     "mds_standby_replay": "true",
>
>
>
> The cluster status is as follows:
>
>
>
> cluster:
>
>     id:     f774b9b2-d514-40d9-85ab-d0389724b6c0
>
>     health: HEALTH_OK
>
>
>
>   services:
>
>     mon: 3 daemons, quorum dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05
>
>     mgr: dub-sitv-ceph-04(active), standbys: dub-sitv-ceph-03, dub-sitv-ceph-05
>
>     mds: cephfs-1/1/1 up  {0=dub-sitv-ceph-02=up:active}, 1 up:standby-replay
>
>     osd: 4 osds: 4 up, 4 in
>
>
>
>   data:
>
>     pools:   2 pools, 200 pgs
>
>     objects: 554  objects, 980 MiB
>
>     usage:   7.9 GiB used, 1.9 TiB / 2.0 TiB avail
>
>     pgs:     200 active+clean
>
>
>
>   io:
>
>     client:   1.5 MiB/s rd, 810 KiB/s wr, 286 op/s rd, 218 op/s wr
>
>
>
> Hope someone can help!
>
> William Lawton
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com