Re: MDS / CephFS behaviour with unusual directory layout

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,


one particular interesting point in setups with a large number of active files/caps is the failover.


If your MDS fails (assuming single MDS, multiple MDS with multiple active ranks behave in the same way for _each_ rank), the monitors will detect the failure and update the mds map. CephFS clients will be notified about the update, connect to the new MDS the rank has failed over to (hopefully within the connect timeout...). They will also re-request all their currently active caps from the MDS to allow it to recreate the state of the point in time before the failure.


And this is were things can get "interesting". Assuming a cold standby MDS, the MDS will receive the information about all active files and capabilities assigned to the various client. It also has to _stat_ all these files during the rejoin phase. And if million of files have to stat'ed, this may take time, put a lot of pressure on the metadata and data pools, and might even lead to timeouts and subsequent failure or failover to another MDS.


We had some problems with this in the past, but it became better and less failure prone with every ceph release (great work, ceph developers!). Our current setup has up to 15 million cached inodes and several million caps in the worst case (during nightly backup). The caps per client limit in luminous/nautilus? helps a lot with reducing the number of active files and caps.

Prior to nautilus we configured a secondary MDS as standby-replay, which allows it to cache the same inodes that were active on the primary. During rejoin the stat call can be served from cache, which makes the failover a lot faster and less demanding for the ceph cluster itself. In nautilus the setup for standby-replay has moved from a daemon feature to a filesystem feature (one spare MDS becomes designated standby-replay for a rank). But there are also other caveats like not selecting one of these as failover for another rank.


So if you want to test cephfs for your use case, I would highly recommend to test failover, too. Both a controlled failover and an unexpected one. You may also want to use multiple active MDS, but my experience with these setups is limited.


Regards,

Burkhard


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux