Re: MDS / CephFS behaviour with unusual directory layout

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Fri, 26 Jul 2019 18:40:58 +0200

Hi,

one particular interesting point in setups with a large number of active 
files/caps is the failover.

If your MDS fails (assuming single MDS, multiple MDS with multiple 
active ranks behave in the same way for _each_ rank), the monitors will 
detect the failure and update the mds map. CephFS clients will be 
notified about the update, connect to the new MDS the rank has failed 
over to (hopefully within the connect timeout...). They will also 
re-request all their currently active caps from the MDS to allow it to 
recreate the state of the point in time before the failure.

And this is were things can get "interesting". Assuming a cold standby 
MDS, the MDS will receive the information about all active files and 
capabilities assigned to the various client. It also has to _stat_ all 
these files during the rejoin phase. And if million of files have to 
stat'ed, this may take time, put a lot of pressure on the metadata and 
data pools, and might even lead to timeouts and subsequent failure or 
failover to another MDS.

We had some problems with this in the past, but it became better and 
less failure prone with every ceph release (great work, ceph 
developers!). Our current setup has up to 15 million cached inodes and 
several million caps in the worst case (during nightly backup). The caps 
per client limit in luminous/nautilus? helps a lot with reducing the 
number of active files and caps.

Prior to nautilus we configured a secondary MDS as standby-replay, which 
allows it to cache the same inodes that were active on the primary. 
During rejoin the stat call can be served from cache, which makes the 
failover a lot faster and less demanding for the ceph cluster itself. In 
nautilus the setup for standby-replay has moved from a daemon feature to 
a filesystem feature (one spare MDS becomes designated standby-replay 
for a rank). But there are also other caveats like not selecting one of 
these as failover for another rank.

So if you want to test cephfs for your use case, I would highly 
recommend to test failover, too. Both a controlled failover and an 
unexpected one. You may also want to use multiple active MDS, but my 
experience with these setups is limited.

Regards,

Burkhard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com