Re: MDS does not always failover to hot standby

Bryan Henderson <bryanh@xxxxxxxxxxxxxxxx> · 8 Sep 2018 02:07:49 +0000

> It's mds_beacon_grace.  Set that on the monitor to control the replacement of
> laggy MDS daemons,

Sounds like William's issue is something else.  William shuts down MDS 2 and
MON 4 simultaneously.  The log shows that some time later (we don't know how
long), MON 3 detects that MDS 2 is gone ("MDS_ALL_DOWN"), but does nothing
about it until 30 seconds later, which happens to be when MDS 2 and MON 4 come
back.  At that point, MON 3 reports that the rank has been reassigned to MDS
1.

'mds_beacon_grace' determines when a monitor declares MDS_ALL_DOWN, right?

I think if things are working as designed, the log should show MON 3
reassigning the rank to MDS 1 immediately after it reports MDS 2 is gone.

>From the original post:

2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 55 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2)
2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN)
2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05
2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive, 115 pgs peering (PG_AVAILABILITY)
2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504 objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 69 pgs peering (PG_AVAILABILITY)
2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 69 pgs peering)
2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572 objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED)
2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 1 : cluster [INF] mon.dub-sitv-ceph-04 calling monitor election
2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 2 : cluster [WRN] message from mon.0 was stamped 0.817433s in the future, clocks not synchronized
2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election
2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 227 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election
2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 73 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2)
2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 78 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05)
2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 79 : cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max 0.05s
2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 80 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded
2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 83 : cluster [INF] daemon mds.dub-sitv-ceph-02 restarted
2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 85 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 86 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 88 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0
2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 89 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)

-- 
Bryan Henderson                                   San Jose, California
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com