MDS flapping: how to increase MDS timeouts?

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Thu, 26 Jan 2017 09:18:56 +0100

HI,

we are running two MDS servers in active/standby-replay setup. Recently 
we had to disconnect active MDS server, and failover to standby works as 
expected.

The filesystem currently contains over 5 million files, so reading all 
the metadata information from the data pool took too long, since the 
information was not available on the OSD page caches. The MDS was timed 
out by the mons, and a failover switch to the former active MDS (which 
was available as standby again) happened. This MDS in turn had to read 
the metadata, again running into a timeout, failover, etc. I resolved 
the situation by disabling one of the MDS, which kept the mons from 
failing the now solely available MDS.

So given a large filesystem, how do I prevent failover flapping between 
MDS instances that are in the rejoin state and reading the inode 
information?

Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com