Re: MDS does not always failover to hot standby on reboot

William Lawton <william.lawton@xxxxxxxxxx> · Thu, 30 Aug 2018 19:46:14 +0000

Oh i see. We’d taken steps to reduce the risk of losing the active mds and mon leader instances at the same time in the hope that it would prevent this issue. Do you know if the mds always connects to a specific mon instance i.e. the mon provider and can it
 be determined which mon instance that is? Or is it adhoc?

Sent from my iPhone

On 30 Aug 2018, at 20:01, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

Okay, well that will be the same reason then. If the active MDS is connectedng  to a monitor and they fail at the same time, the monitors can’t replace the mds until they’ve been through their own election and a full mds timeout window.

On Thu, Aug 30, 2018 at 11:46 AM William Lawton <william.lawton@xxxxxxxxxx> wrote:

Thanks for the response Greg. We did originally have co-located mds and mon but realised this wasn't a good idea early on and separated them out onto different hosts. So our mds hosts are on ceph-01 and ceph-02, and our mon hosts are on ceph-03,
 04 and 05. Unfortunately we see this issue occurring when we reboot ceph-02(mds) and ceph-04(mon) together. We expect ceph-01 to become the active mds but often it doesnt.

Sent from my iPhone

On 30 Aug 2018, at 17:46, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

Yes, this is a consequence of co-locating the MDS and monitors — if the MDS reports to its co-located monitor and both fail, the monitor cluster has to go through its own failure detection and then wait for a full MDS timeout period after that
 before it marks the MDS down. :(

We might conceivably be able to optimize for this, but there's not a general solution. If you need to co-locate, one thing that would make it better without being a lot of work is trying to have the MDS connect to one of the monitors on a different host.
 You can do that by just restricting the list of monitors you feed it in the ceph.conf, although it's not a guarantee that will *prevent* it from connecting to its own monitor if there are failures or reconnects after first startup.
-Greg

On Thu, Aug 30, 2018 at 8:38 AM William Lawton <william.lawton@xxxxxxxxxx> wrote:

Hi.

We have a 5 node Ceph cluster (refer to ceph -s output at bottom of email). During resiliency tests we have an occasional problem when we reboot the active MDS instance and a MON instance together i.e.  dub-sitv-ceph-02 and dub-sitv-ceph-04.
 We expect the MDS to failover to the standby instance dub-sitv-ceph-01 which is in standby-replay mode, and 80% of the time it does with no problems. However, 20% of the time it doesn’t and the MDS_ALL_DOWN health check is not cleared until 30 seconds later
 when the rebooted dub-sitv-ceph-02 and dub-sitv-ceph-04 instances come back up.

When the MDS successfully fails over to the standby we see in the ceph.log the following:

2018-08-25 00:30:02.231811 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 50 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 00:30:02.237389 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 52 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 00:30:02.237528 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 54 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

When the active MDS role does not failover to the standby the MDS_ALL_DOWN check is not cleared until after the rebooted instances have come back up e.g.:

2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 55 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 
10.18.186.208:6789/0 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive, 115 pgs peering (PG_AVAILABILITY)
2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504 objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 69 pgs peering (PG_AVAILABILITY)
2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 69 pgs peering)
2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572 objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 
10.18.53.155:6789/0 1 : cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 
10.18.53.155:6789/0 2 : cluster [WRN] message from mon.0 was stamped 0.817433s in the future, clocks not synchronized
2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 
10.18.186.208:6789/0 227 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 73 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 78 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05)
2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 79 : cluster [WRN] mon.1 
10.18.53.155:6789/0 clock skew 0.811318s > max 0.05s
2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 80 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded
2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 83 : cluster [INF] daemon mds.dub-sitv-ceph-02 restarted
2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 85 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 86 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 88 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 
10.18.53.32:6789/0 89 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

In the MDS log we’ve noticed that when the issue occurs, at precisely the time when the active MDS/MON nodes are rebooted, the standby MDS instance briefly stops logging replay_done (as standby). This is shown in the log exert below where
 there is a 9s gap in these logs.

2018-08-25 03:30:00.085 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:01.091 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:10.332 7f3ab9b00700  1 mds.0.0 replay_done (as standby)
2018-08-25 03:30:11.333 7f3abb303700  1 mds.0.0 replay_done (as standby)

I’ve tried to reproduce the issue by rebooting each MDS instance in turn repeatedly 5 minutes apart but so far haven’t been able to do so, so my assumption is that rebooting the MDS and a MON instance at the same time is a significant factor.

Our mds_standby* configuration is set as follows:

    "mon_force_standby_active": "true",
    "mds_standby_for_fscid": "-1",
    "mds_standby_for_name": "",
    "mds_standby_for_rank": "0",
    "mds_standby_replay": "true",

The cluster status is as follows:

cluster:
    id:     f774b9b2-d514-40d9-85ab-d0389724b6c0
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05
    mgr: dub-sitv-ceph-04(active), standbys: dub-sitv-ceph-03, dub-sitv-ceph-05
    mds: cephfs-1/1/1 up  {0=dub-sitv-ceph-02=up:active}, 1 up:standby-replay
    osd: 4 osds: 4 up, 4 in

  data:
    pools:   2 pools, 200 pgs
    objects: 554  objects, 980 MiB
    usage:   7.9 GiB used, 1.9 TiB / 2.0 TiB avail
    pgs:     200 active+clean

  io:
    client:   1.5 MiB/s rd, 810 KiB/s wr, 286 op/s rd, 218 op/s wr

Hope someone can help!
William Lawton

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com