Re: MDS / CephFS behaviour with unusual directory layout

Nathan Fish <lordcirth@xxxxxxxxx> · Fri, 26 Jul 2019 12:44:10 -0400

Yes, definitely enable standby-replay. I saw sub-second failovers with
standby-replay, but when I restarted the new rank 0 (previously 0-s)
while the standby was syncing up to become 0-s, the failover took
several minutes. This was with ~30GiB of cache.

On Fri, Jul 26, 2019 at 12:41 PM Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
>
> one particular interesting point in setups with a large number of active
> files/caps is the failover.
>
>
> If your MDS fails (assuming single MDS, multiple MDS with multiple
> active ranks behave in the same way for _each_ rank), the monitors will
> detect the failure and update the mds map. CephFS clients will be
> notified about the update, connect to the new MDS the rank has failed
> over to (hopefully within the connect timeout...). They will also
> re-request all their currently active caps from the MDS to allow it to
> recreate the state of the point in time before the failure.
>
>
> And this is were things can get "interesting". Assuming a cold standby
> MDS, the MDS will receive the information about all active files and
> capabilities assigned to the various client. It also has to _stat_ all
> these files during the rejoin phase. And if million of files have to
> stat'ed, this may take time, put a lot of pressure on the metadata and
> data pools, and might even lead to timeouts and subsequent failure or
> failover to another MDS.
>
>
> We had some problems with this in the past, but it became better and
> less failure prone with every ceph release (great work, ceph
> developers!). Our current setup has up to 15 million cached inodes and
> several million caps in the worst case (during nightly backup). The caps
> per client limit in luminous/nautilus? helps a lot with reducing the
> number of active files and caps.
>
> Prior to nautilus we configured a secondary MDS as standby-replay, which
> allows it to cache the same inodes that were active on the primary.
> During rejoin the stat call can be served from cache, which makes the
> failover a lot faster and less demanding for the ceph cluster itself. In
> nautilus the setup for standby-replay has moved from a daemon feature to
> a filesystem feature (one spare MDS becomes designated standby-replay
> for a rank). But there are also other caveats like not selecting one of
> these as failover for another rank.
>
>
> So if you want to test cephfs for your use case, I would highly
> recommend to test failover, too. Both a controlled failover and an
> unexpected one. You may also want to use multiple active MDS, but my
> experience with these setups is limited.
>
>
> Regards,
>
> Burkhard
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com