> It's mds_beacon_grace. Set that on the monitor to control the replacement of > laggy MDS daemons, Sounds like William's issue is something else. William shuts down MDS 2 and MON 4 simultaneously. The log shows that some time later (we don't know how long), MON 3 detects that MDS 2 is gone ("MDS_ALL_DOWN"), but does nothing about it until 30 seconds later, which happens to be when MDS 2 and MON 4 come back. At that point, MON 3 reports that the rank has been reassigned to MDS 1. 'mds_beacon_grace' determines when a monitor declares MDS_ALL_DOWN, right? I think if things are working as designed, the log should show MON 3 reassigning the rank to MDS 1 immediately after it reports MDS 2 is gone. >From the original post: 2018-08-25 03:30:02.936554 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 55 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN) 2018-08-25 03:30:04.235703 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 226 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election 2018-08-25 03:30:04.238672 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 56 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election 2018-08-25 03:30:09.242595 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 57 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-05 in quorum (ranks 0,2) 2018-08-25 03:30:09.252804 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 62 : cluster [WRN] Health check failed: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 (MON_DOWN) 2018-08-25 03:30:09.258693 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 63 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05 2018-08-25 03:30:10.254162 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 64 : cluster [WRN] Health check failed: Reduced data availability: 2 pgs inactive, 115 pgs peering (PG_AVAILABILITY) 2018-08-25 03:30:12.429145 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 66 : cluster [WRN] Health check failed: Degraded data redundancy: 712/2504 objects degraded (28.435%), 86 pgs degraded (PG_DEGRADED) 2018-08-25 03:30:16.137408 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 67 : cluster [WRN] Health check update: Reduced data availability: 1 pg inactive, 69 pgs peering (PG_AVAILABILITY) 2018-08-25 03:30:17.193322 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 68 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 69 pgs peering) 2018-08-25 03:30:18.432043 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 69 : cluster [WRN] Health check update: Degraded data redundancy: 1286/2572 objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED) 2018-08-25 03:30:26.139491 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 71 : cluster [WRN] Health check update: Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded (PG_DEGRADED) 2018-08-25 03:30:31.355321 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 1 : cluster [INF] mon.dub-sitv-ceph-04 calling monitor election 2018-08-25 03:30:31.371519 mon.dub-sitv-ceph-04 mon.1 10.18.53.155:6789/0 2 : cluster [WRN] message from mon.0 was stamped 0.817433s in the future, clocks not synchronized 2018-08-25 03:30:32.175677 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 72 : cluster [INF] mon.dub-sitv-ceph-03 calling monitor election 2018-08-25 03:30:32.175864 mon.dub-sitv-ceph-05 mon.2 10.18.186.208:6789/0 227 : cluster [INF] mon.dub-sitv-ceph-05 calling monitor election 2018-08-25 03:30:32.180615 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 73 : cluster [INF] mon.dub-sitv-ceph-03 is new leader, mons dub-sitv-ceph-03,dub-sitv-ceph-04,dub-sitv-ceph-05 in quorum (ranks 0,1,2) 2018-08-25 03:30:32.189593 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 78 : cluster [INF] Health check cleared: MON_DOWN (was: 1/3 mons down, quorum dub-sitv-ceph-03,dub-sitv-ceph-05) 2018-08-25 03:30:32.190820 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 79 : cluster [WRN] mon.1 10.18.53.155:6789/0 clock skew 0.811318s > max 0.05s 2018-08-25 03:30:32.194280 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 80 : cluster [WRN] overall HEALTH_WARN 2 osds down; 2 hosts (2 osds) down; Degraded data redundancy: 1292/2584 objects degraded (50.000%), 166 pgs degraded 2018-08-25 03:30:35.076121 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 83 : cluster [INF] daemon mds.dub-sitv-ceph-02 restarted 2018-08-25 03:30:35.270222 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 85 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED) 2018-08-25 03:30:35.270267 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 86 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN) 2018-08-25 03:30:35.282139 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 88 : cluster [INF] Standby daemon mds.dub-sitv-ceph-01 assigned to filesystem cephfs as rank 0 2018-08-25 03:30:35.282268 mon.dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 89 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline) -- Bryan Henderson San Jose, California _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com