Yes, definitely enable standby-replay. I saw sub-second failovers with standby-replay, but when I restarted the new rank 0 (previously 0-s) while the standby was syncing up to become 0-s, the failover took several minutes. This was with ~30GiB of cache. On Fri, Jul 26, 2019 at 12:41 PM Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi, > > > one particular interesting point in setups with a large number of active > files/caps is the failover. > > > If your MDS fails (assuming single MDS, multiple MDS with multiple > active ranks behave in the same way for _each_ rank), the monitors will > detect the failure and update the mds map. CephFS clients will be > notified about the update, connect to the new MDS the rank has failed > over to (hopefully within the connect timeout...). They will also > re-request all their currently active caps from the MDS to allow it to > recreate the state of the point in time before the failure. > > > And this is were things can get "interesting". Assuming a cold standby > MDS, the MDS will receive the information about all active files and > capabilities assigned to the various client. It also has to _stat_ all > these files during the rejoin phase. And if million of files have to > stat'ed, this may take time, put a lot of pressure on the metadata and > data pools, and might even lead to timeouts and subsequent failure or > failover to another MDS. > > > We had some problems with this in the past, but it became better and > less failure prone with every ceph release (great work, ceph > developers!). Our current setup has up to 15 million cached inodes and > several million caps in the worst case (during nightly backup). The caps > per client limit in luminous/nautilus? helps a lot with reducing the > number of active files and caps. > > Prior to nautilus we configured a secondary MDS as standby-replay, which > allows it to cache the same inodes that were active on the primary. > During rejoin the stat call can be served from cache, which makes the > failover a lot faster and less demanding for the ceph cluster itself. In > nautilus the setup for standby-replay has moved from a daemon feature to > a filesystem feature (one spare MDS becomes designated standby-replay > for a rank). But there are also other caveats like not selecting one of > these as failover for another rank. > > > So if you want to test cephfs for your use case, I would highly > recommend to test failover, too. Both a controlled failover and an > unexpected one. You may also want to use multiple active MDS, but my > experience with these setups is limited. > > > Regards, > > Burkhard > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com