Hi! >From documentation: mds beacon grace Description: The interval without beacons before Ceph declares an MDS laggy (and possibly replace it). Type: Float Default: 15 I do not understand, 15 - are is seconds or beacons? And an additional misunderstanding - if we gently turn off the MDS (or MON), why it does not inform everyone interested before death - "I am turned off, no need to wait, appoint a new active server" ----- Original Message ----- From: "David Turner" <drakonstein@xxxxxxxxx> To: "Gregory Farnum" <gfarnum@xxxxxxxxxx> Cc: "Fyodor Ustinov" <ufm@xxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx> Sent: Tuesday, 19 February, 2019 20:57:49 Subject: Re: faster switch to another mds It's also been mentioned a few times that when MDS and MON are on the same host that the downtime for MDS is longer when both daemons stop at about the same time. It's been suggested to stop the MDS daemon, wait for `ceph mds stat` to reflect the change, and then restart the rest of the server. HTH. On Mon, Feb 11, 2019 at 3:55 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote: > You can't tell from the client log here, but probably the MDS itself was > failing over to a new instance during that interval. There's not much > experience with it, but you could experiment with faster failover by > reducing the mds beacon and grace times. This may or may not work > reliably... > > On Sat, Feb 9, 2019 at 10:52 AM Fyodor Ustinov <ufm@xxxxxx> wrote: > >> Hi! >> >> I have ceph cluster with 3 nodes with mon/mgr/mds servers. >> I reboot one node and see this in client log: >> >> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 socket >> closed (con state OPEN) >> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon2 10.5.105.40:6789 session >> lost, hunting for new mon >> Feb 09 20:29:14 ceph-nfs1 kernel: libceph: mon0 10.5.105.34:6789 session >> established >> Feb 09 20:29:22 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket >> closed (con state OPEN) >> Feb 09 20:29:23 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket >> closed (con state CONNECTING) >> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket >> closed (con state CONNECTING) >> Feb 09 20:29:24 ceph-nfs1 kernel: libceph: mds0 10.5.105.40:6800 socket >> closed (con state CONNECTING) >> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect start >> Feb 09 20:29:53 ceph-nfs1 kernel: ceph: mds0 reconnect success >> Feb 09 20:30:05 ceph-nfs1 kernel: ceph: mds0 recovery completed >> >> As I understand it, the following has happened: >> 1. Client detects - link with mon server broken and fast switches to >> another mon (less that 1 seconds). >> 2. Client detects - link with mds server broken, 3 times trying reconnect >> (unsuccessful), waiting and reconnects to the same mds after 30 seconds >> downtime. >> >> I have 2 questions: >> 1. Why? >> 2. How to reduce switching time to another mds? >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com