Hi,
one particular interesting point in setups with a large number of active
files/caps is the failover.
If your MDS fails (assuming single MDS, multiple MDS with multiple
active ranks behave in the same way for _each_ rank), the monitors will
detect the failure and update the mds map. CephFS clients will be
notified about the update, connect to the new MDS the rank has failed
over to (hopefully within the connect timeout...). They will also
re-request all their currently active caps from the MDS to allow it to
recreate the state of the point in time before the failure.
And this is were things can get "interesting". Assuming a cold standby
MDS, the MDS will receive the information about all active files and
capabilities assigned to the various client. It also has to _stat_ all
these files during the rejoin phase. And if million of files have to
stat'ed, this may take time, put a lot of pressure on the metadata and
data pools, and might even lead to timeouts and subsequent failure or
failover to another MDS.
We had some problems with this in the past, but it became better and
less failure prone with every ceph release (great work, ceph
developers!). Our current setup has up to 15 million cached inodes and
several million caps in the worst case (during nightly backup). The caps
per client limit in luminous/nautilus? helps a lot with reducing the
number of active files and caps.
Prior to nautilus we configured a secondary MDS as standby-replay, which
allows it to cache the same inodes that were active on the primary.
During rejoin the stat call can be served from cache, which makes the
failover a lot faster and less demanding for the ceph cluster itself. In
nautilus the setup for standby-replay has moved from a daemon feature to
a filesystem feature (one spare MDS becomes designated standby-replay
for a rank). But there are also other caveats like not selecting one of
these as failover for another rank.
So if you want to test cephfs for your use case, I would highly
recommend to test failover, too. Both a controlled failover and an
unexpected one. You may also want to use multiple active MDS, but my
experience with these setups is limited.
Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com